AI FOR SECURITY - PROJECT 2024/2025¶

Group 05:
  • Mattia Campana - 10712872
  • Camilla Foresti - 10739259
  • Filippo Gotta - 10774461

INTRODUCTION¶

The Internet of Medical Things (IoMT) has become increasingly vital in healthcare, enabling continuous patient monitoring and automated medical services. However, this connectivity also introduces cybersecurity risks that could compromise patient care and privacy. The CICIoMT2024 dataset, developed by the Canadian Institute for Cybersecurity (CIC), provides a comprehensive benchmark for evaluating IoMT security solutions.

Dataset Overview¶

  • Contains network traffic from 40 IoMT devices (25 real and 15 simulated)
  • Includes traffic across multiple protocols: Wi-Fi, MQTT, and Bluetooth
  • Features 18 different types of attacks categorized into:
    • DDoS (Distributed Denial of Service)
    • DoS (Denial of Service)
    • Reconnaissance
    • MQTT-specific attacks
    • Spoofing

Project Objectives¶

  1. Analyze a subset of the CICIoMT2024 dataset to develop efficient security models
  2. Implement and evaluate both supervised and unsupervised learning approaches
  3. Compare model performance across different attack categories
  4. Identify the most effective model for real-time threat detection in IoMT networks

Our analysis will focus on the trade-off between model complexity and detection accuracy, aiming to provide practical insights for securing healthcare IoT infrastructure.

LIBRARIES¶

Data Manipulation, Preprocessing and Analysis¶

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE

Data Visualisation¶

In [2]:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
from sklearn.manifold import TSNE
from tabulate import tabulate
from tqdm import tqdm
from rich import print
from rich.tree import Tree
from rich.panel import Panel

Text Processing¶

In [3]:
import re

Machine Learning¶

Models¶

In [4]:
# Supervised
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

# Neural Network (supervised)
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# Unsupervised

## Clustering (unsupervised)
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
from sklearn.cluster import KMeans, DBSCAN

## Anomaly Detection (unsupervised)
from sklearn.svm import OneClassSVM
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

Optimisation¶

In [5]:
from skopt import BayesSearchCV
from sklearn.model_selection import GridSearchCV, StratifiedKFold

Evaluation¶

In [6]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, f1_score,  silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.metrics import make_scorer, roc_curve, roc_auc_score
from sklearn.calibration import CalibrationDisplay

System/File Operations¶

In [7]:
import os
from pathlib import Path
import shutil
from typing import Tuple, Dict, Literal, Optional, List
from collections import defaultdict
import time
from joblib import dump, load
import json

SEEDING¶

To make our code reproducible, we set a random seed everytime a randomised operation is performed. With no specific reason we decided to use each time the number 37. Wait... maybe there's a reason...

https://youtu.be/d6iQrh2TK98?si=174Z_qsxlW1cdiDd - "Why is this number everywhere?", Veritasium

http://thirty-seven.org - 37 Heaven

1. DATASET PREPARATION¶

Download Dataset¶

Create the directory and download the whole CSV dataset through wget

mkdir dataset
cd dataset/ && wget -c -r -np -nH --cut-dirs=3 -R "index.html*" http://205.174.165.80/IOTDataset/CICIoMT2024/Dataset/WiFI_and_MQTT/attacks/CSV/

Analyse Dataset¶

The dataset consists of network traffic files organized by protocol and specific attack type. Each row concerns a grouped quantity of traffic packets, that spans from 10 to 100 packets, and for this reason some features are representative not of a single event but they are statistical analysis for the whole group considered.

Due to the diverse distribution of data across files, the first step consisted in analyzing the volume of samples. This initial assessment was crucial as significant imbalances in data volume between attack types or protocols could skew our analysis.

Understanding distributions allowed us to create a balanced and representative subset of data for our analysis while ensuring computational efficiency.

In this phase, we want to explore the directory and provide a detailed analysis of its structure and content. It is important to remark that files are named based on the protocol and the attack type; in some cases there are multiple files concerning the same protocol and attack type, for this reason they have been numerated starting from zero. In order to avoid unbalance in data sampling, the following code allows us to group files based on unique base names. In this way, we obtain 19 files, 18 of which about attacks and 1 about benign data.

In [38]:
def get_size(start_path: str) -> int:
    return sum(os.path.getsize(os.path.join(dirpath, f))
              for dirpath, _, filenames in os.walk(start_path)
              for f in filenames)

def explore_subdirectory(path: str, level: int = 0) -> Dict:
    contents = os.listdir(path)
    dirs = [d for d in contents if os.path.isdir(os.path.join(path, d))]
    files = [f for f in contents if os.path.isfile(os.path.join(path, f))]
    
    stats = {
        'directory': os.path.basename(path),
        'num_dirs': len(dirs),
        'num_files': len(files),
        'size_mb': get_size(path) / (1024 * 1024),
        'unique_bases': len(set(re.sub(r'\d+', '', f) for f in files))
    }
    
    tree = Tree(f"📁 {stats['directory']}")
    tree.add(f"Directories: {stats['num_dirs']}")
    tree.add(f"Files: {stats['num_files']}")
    tree.add(f"Size: {stats['size_mb']:.2f} MB")
    tree.add(f"Unique base names: {stats['unique_bases']}")
    
    print(Panel(tree, border_style="black"))
    
    for subdir in dirs:
        explore_subdirectory(os.path.join(path, subdir), level + 1)
        
    return stats
In [39]:
_ = explore_subdirectory('dataset/original')
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ 📁 original                                                                                                     │
│ ├── Directories: 2                                                                                              │
│ ├── Files: 0                                                                                                    │
│ ├── Size: 2171.92 MB                                                                                            │
│ └── Unique base names: 0                                                                                        │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ 📁 test_og                                                                                                      │
│ ├── Directories: 0                                                                                              │
│ ├── Files: 21                                                                                                   │
│ ├── Size: 393.78 MB                                                                                             │
│ └── Unique base names: 19                                                                                       │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ 📁 train_og                                                                                                     │
│ ├── Directories: 0                                                                                              │
│ ├── Files: 51                                                                                                   │
│ ├── Size: 1778.14 MB                                                                                            │
│ └── Unique base names: 19                                                                                       │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Through the statistics above, it is possible to merge CSV files, based on their base names, within specified directories, respectively 'train_og' for training and 'test_og' for testing data. The code below allows for identifying all CSV files in the directory and grouping them by their base name. For each group containing more than one file, the function reads the files, concatenates them into a single DataFrame by merging their contents row-wise, and saves the resulting file back into the directory using the base name.

In [40]:
def merge_files_by_base_name(directory: str) -> None:
    in_dir = os.path.join('dataset/original', directory + '_og')
    out_dir = os.path.join('dataset/merged', directory)
    
    if not os.path.exists(out_dir):
        os.makedirs(out_dir)

    # Get all files in the directory
    files = [f for f in os.listdir(in_dir) if f.endswith('.csv')]
    
    # Group files by base name (excluding trailing numbers)
    grouped_files = defaultdict(list)
    for file in files:
        base_name = re.sub(r'\d+', '', file)
        grouped_files[base_name].append(file)
    
    # Merge files in each group
    for base_name, file_list in grouped_files.items():
        dfs = []
        if len(file_list) <= 1:
            src_file = os.path.join(in_dir, file_list[0])
            dest_file = os.path.join(out_dir, f"{base_name}")
            shutil.copy(src_file, dest_file)
            continue

        for file in file_list:
            file_path = os.path.join(in_dir, file)
            df = pd.read_csv(file_path)
            dfs.append(df)
        
        # Concatenate all dataframes in the group
        merged_df = pd.concat(dfs, ignore_index=True)
        
        # Save the merged dataframe
        output_file = os.path.join(out_dir, f"{base_name}")
        merged_df.to_csv(output_file, index=False)
In [41]:
merge_files_by_base_name('train')
merge_files_by_base_name('test')
In [42]:
_ = explore_subdirectory('dataset/merged')
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ 📁 merged                                                                                                       │
│ ├── Directories: 2                                                                                              │
│ ├── Files: 0                                                                                                    │
│ ├── Size: 2171.90 MB                                                                                            │
│ └── Unique base names: 0                                                                                        │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ 📁 train                                                                                                        │
│ ├── Directories: 0                                                                                              │
│ ├── Files: 19                                                                                                   │
│ ├── Size: 1778.12 MB                                                                                            │
│ └── Unique base names: 19                                                                                       │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ 📁 test                                                                                                         │
│ ├── Directories: 0                                                                                              │
│ ├── Files: 19                                                                                                   │
│ ├── Size: 393.78 MB                                                                                             │
│ └── Unique base names: 19                                                                                       │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

File Length Analysis¶

The following code analyzes and visualizes data distribution across files, implementing a sampling process based on file size, as follows:

  • Files < 20k rows or benign files: retained entirely
  • Files 20K - 100K rows: keep 20k rows + 20% of remaining data
  • Files > 100K rows: apply previous rules + 10% of remaining data

This stratified sampling approach helps maintain dataset balance while managing computational resources. The visualization uses color coding (green/orange/red) to highlight size thresholds.

In [43]:
PLOT_CONFIG = {
    'thresholds': {'small': 20000, 'medium': 100000},
    'colors': {'small': 'green', 'medium': 'orange', 'large': 'red'},
    'base_path': 'dataset'
}

def get_color_by_length(length: int) -> str:
    if length < PLOT_CONFIG['thresholds']['small']:
        return PLOT_CONFIG['colors']['small']
    elif length < PLOT_CONFIG['thresholds']['medium']:
        return PLOT_CONFIG['colors']['medium']
    return PLOT_CONFIG['colors']['large']

def get_directory_path(name: str, directory_type: Literal['original', 'subset']) -> str:
    return os.path.join(PLOT_CONFIG['base_path'], directory_type, f"{name}")

def get_file_lengths(data_dir: str) -> Dict[str, int]:
    return {
        file: len(pd.read_csv(os.path.join(data_dir, file)))
        for file in os.listdir(data_dir)
        if file.endswith('.csv') and not re.search(r'\d', file)
    }
In [44]:
def plot_len(
    name: str,
    directory_type: Literal['merged', 'subset'] = 'merged',
    sort_by: Literal['name', 'length'] = 'length',
    ascending: bool = True
) -> None:
    # Get data
    data_dir = get_directory_path(name, directory_type)
    if not os.path.exists(data_dir):
        print(f"Directory not found: {data_dir}")
        return
        
    file_lengths = get_file_lengths(data_dir)
    if not file_lengths:
        print(f"No CSV files found in {data_dir}")
        return

    # Sort data
    sorted_items = sorted(
        file_lengths.items(),
        key=lambda x: x[0] if sort_by == 'name' else x[1],
        reverse=not ascending
    )
    files, lengths = zip(*sorted_items)

    # Create and style plot
    plt.figure(figsize=(15, 8))
    bars = plt.bar(files, lengths)

    if directory_type == "merged":
        for bar in bars:
            bar.set_color(get_color_by_length(bar.get_height()))
        
        legend_elements = [
            plt.Rectangle((0,0), 1, 1, facecolor=color, label=f'{label} rows')
            for label, color in [('<20k', 'green'), ('20k-100k', 'orange'), ('>100k', 'red')]
        ]
        plt.legend(handles=legend_elements)

    plt.title(f'Number of Rows in Each CSV File - {name} Dataset ({directory_type})', pad=20)
    plt.xlabel('File Names')
    plt.ylabel('Number of Rows')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    
    # Save and display
    os.makedirs('img', exist_ok=True)
    plt.savefig(f'img/file_lengths_{name}_{directory_type}.png', bbox_inches='tight')
    plt.show()
In [45]:
plot_len('train')
plot_len('test')
No description has been provided for this image
No description has been provided for this image

Subset¶

Create Directory¶

mkdir dataset/subset && cd dataset/subset
mkdir train && mkdir test

Subset Generation¶

This code implements our sampling strategy through random sampling, using a fixed random seed for reproducibility.

For each file, it applies the rules presented before. The stratified approach maintains data representativeness while reducing volume, with random sampling ensuring unbiased selection across each range.

In [46]:
SUB_CONFIG = {
    'RANDOM_SEED': 37,
    'CAP': 20000,
    'FIRST': 0.2,
    'SECOND': 0.1,
    'DIRS': {
        'original': 'dataset/merged',
        'subset': 'dataset/subset'
    }
}

def calculate_subset(df: pd.DataFrame) -> pd.DataFrame:
    """Create subset based on dataframe size."""
    total_rows = len(df)
    
    if total_rows < SUB_CONFIG['CAP']:
        return df
        
    # Get first 20k random rows
    first = df.sample(n=SUB_CONFIG['CAP'], random_state=SUB_CONFIG['RANDOM_SEED'])
    
    if total_rows < 100000:
        # For files between 20k and 100k
        remaining = df.drop(first.index)
        additional = remaining.sample(frac=SUB_CONFIG['FIRST'], 
                                   random_state=SUB_CONFIG['RANDOM_SEED'])
        return pd.concat([first, additional])
    
    # For files over 100k
    next_80k = df.drop(first.index).sample(
        n=(100000-SUB_CONFIG['CAP']), 
        random_state=SUB_CONFIG['RANDOM_SEED']
    )
    twenty_percent = next_80k.sample(
        frac=0.2, 
        random_state=SUB_CONFIG['RANDOM_SEED']
    )
    
    remaining = df.drop(first.index).drop(next_80k.index)
    ten_percent = remaining.sample(
        frac=0.1, 
        random_state=SUB_CONFIG['RANDOM_SEED']
    )
    
    return pd.concat([first, twenty_percent, ten_percent])
In [49]:
def process_file(file_path: str, output_path: str) -> tuple:
    """Process single file and return statistics."""
    df = pd.read_csv(file_path)
    original_size = len(df)
    
    if 'Benign' in file_path:
        df_subset = df.copy()
    else:
        df_subset = calculate_subset(df)
    
    df_subset.to_csv(output_path, index=False)
    
    return original_size, len(df_subset)

def create_subset(name: str) -> None:
    """Create subset for given dataset."""
    np.random.seed(SUB_CONFIG['RANDOM_SEED'])
    
    # Setup directories
    og_dir = os.path.join(SUB_CONFIG['DIRS']['original'], f"{name}")
    sub_dir = os.path.join(SUB_CONFIG['DIRS']['subset'], name)
    os.makedirs(sub_dir, exist_ok=True)
    
    print(f"Processing directory: {og_dir}")
    
    total_orig_size = 0
    total_sub_size = 0
    
    # Process files
    for file in os.listdir(og_dir):
        if file.endswith('.csv') and not re.search(r'\d', file):
            orig_size, sub_size = process_file(
                os.path.join(og_dir, file),
                os.path.join(sub_dir, file)
            )
            
            total_orig_size += orig_size
            total_sub_size += sub_size
            
    
    overall_reduction = ((total_orig_size - total_sub_size) / total_orig_size * 100)
    print(f"Overall reduction: {total_orig_size} -> {total_sub_size} ({overall_reduction:.2f}% reduction)")
In [50]:
create_subset('train')
create_subset('test')
Processing directory: dataset/merged/train
Overall reduction: 7160831 -> 1235465 (82.75% reduction)
Processing directory: dataset/merged/test
Overall reduction: 1614182 -> 465367 (71.17% reduction)
In [51]:
def plot_comparison(name: str) -> None:
    # Get data for both directories
    original_dir = get_directory_path(name, 'merged')
    subset_dir = get_directory_path(name, 'subset')
    
    if not all(os.path.exists(d) for d in [original_dir, subset_dir]):
        print(f"One or both directories not found: {original_dir}, {subset_dir}")
        return
        
    # Get lengths for both datasets
    original_lengths = get_file_lengths(original_dir)
    subset_lengths = get_file_lengths(subset_dir)
    
    if not all([original_lengths, subset_lengths]):
        print("No CSV files found in one or both directories")
        return
        
    # Sort data by length
    orig_items = sorted(original_lengths.items(), key=lambda x: x[1])
    subset_items = sorted(subset_lengths.items(), key=lambda x: x[1])
    
    orig_files, orig_lengths = zip(*orig_items)
    subset_files, subset_lengths = zip(*subset_items)
    
     # Create figure with adjusted size (less height needed without labels)
    fig, ax1 = plt.subplots(figsize=(20, 8))
    ax2 = ax1.twinx()
    
    # Plot original dataset first (background)
    bars1 = ax1.bar(range(len(orig_files)), orig_lengths, 
                    label='Original Dataset', alpha=0.7,
                    color=[get_color_by_length(l) for l in orig_lengths],
                    zorder=1)
    
    # Plot subset dataset second (foreground)
    bars2 = ax2.bar(range(len(subset_files)), subset_lengths, 
                    label='Processed Dataset', alpha=0.5, edgecolor='black',
                    color=[get_color_by_length(l) for l in subset_lengths],
                    zorder=2)
    
    # Customize axes
    ax1.set_ylabel('Number of Rows (Original)', color='darkblue', fontsize=12)
    ax2.set_ylabel('Number of Rows (Processed)', color='darkred', fontsize=12)
    
    # Remove x-axis labels completely
    ax1.set_xticks([])  # Remove x-ticks
    ax1.set_xticklabels([])  # Remove x-tick labels
    
    # Add legends
    color_legend = [
        plt.Rectangle((0,0), 1, 1, facecolor=color, alpha=0.7, label=f'{label} rows')
        for label, color in [('<20k', 'green'), ('20k-100k', 'orange'), ('>100k', 'red')]
    ]
    dataset_legend = [
        plt.Rectangle((0,0), 1, 1, facecolor='gray', alpha=0.7, label='Original'),
        plt.Rectangle((0,0), 1, 1, facecolor='gray', alpha=0.5, label='Processed')
    ]
    
    # Position legend
    ax1.legend(handles=color_legend + dataset_legend, 
              loc='upper left')
    
    # Add title
    plt.title(f'Comparison of Row Counts - {name} Dataset', 
             pad=20, 
             fontsize=14)
    
    # Adjust layout
    plt.tight_layout()
    
    # Save
    plt.savefig(f'img/file_lengths_comparison_{name}.png', 
                bbox_inches='tight',
                dpi=300)
    plt.show()
In [52]:
plot_comparison("train")
plot_comparison("test")
No description has been provided for this image
No description has been provided for this image

Data Merging and Additional Feature Creation¶

After creating subsets, we merged the files into consolidated train and test datasets, adding three categorical features, with incremental level of specificity about the type of traffic:

  • is_benign: Malicious or not (boolean)
  • category: Attack classification (DDoS, DoS, etc.)
  • attack: Specific attack type

The mapping between file names and attack types was automated using regex pattern matching. This consolidation maintains attack classification information while simplifying subsequent analysis.

In [53]:
ATTACK_MAPPING: Dict[str, Tuple[int, str, str]] = {
    r'^Benign':                     (1, 'BENIGN',   'Benign'),
    r'^ARP_Spoofing':               (0, 'SPOOFING', 'ARP Spoofing'),
    r'^Recon-Ping_Sweep':           (0, 'RECON',    'Ping Sweep'),
    r'^Recon-VulScan':              (0, 'RECON',    'Recon VulScan'),
    r'^Recon-OS_Scan':              (0, 'RECON',    'OS Scan'),
    r'^Recon-Port_Scan':            (0, 'RECON',    'Port Scan'),
    r'^MQTT-Malformed_Data':        (0, 'MQTT',     'Malformed Data'),
    r'^MQTT-DoS-Connect_Flood':     (0, 'MQTT',     'DoS Connect Flood'),
    r'^MQTT-DDoS-Publish_Flood':    (0, 'MQTT',     'DDoS Publish Flood'),
    r'^MQTT-DoS-Publish_Flood':     (0, 'MQTT',     'DoS Publish Flood'),
    r'^MQTT-DDoS-Connect_Flood':    (0, 'MQTT',     'DDoS Connect Flood'),
    r'TCP_IP-DoS-TCP':              (0, 'DoS',      'DoS TCP'),
    r'TCP_IP-DoS-ICMP':             (0, 'DoS',      'DoS ICMP'),
    r'TCP_IP-DoS-SYN':              (0, 'DoS',      'DoS SYN'),
    r'TCP_IP-DoS-UDP':              (0, 'DoS',      'DoS UDP'),
    r'TCP_IP-DDoS-SYN':             (0, 'DDoS',     'DDoS SYN'),
    r'TCP_IP-DDoS-TCP':             (0, 'DDoS',     'DDoS TCP'),
    r'TCP_IP-DDoS-ICMP':            (0, 'DDoS',     'DDoS ICMP'),
    r'TCP_IP-DDoS-UDP':             (0, 'DDoS',     'DDoS UDP')
}

def get_category_and_attack(filename: str) -> Tuple[int, str, str]:
    """Get category and attack type from filename."""
    for pattern, (is_benign, category, attack) in ATTACK_MAPPING.items():
        if re.match(pattern, filename):
            return is_benign, category, attack
    return -1, 'UNKNOWN', 'UNKNOWN'
In [54]:
def process_files_in_chunks(directory: str, output_file: str, chunk_size: int = 100000) -> None:
    """Process CSV files in chunks and write directly to output file."""
    files = [f for f in os.listdir(directory) if f.endswith('.csv')]
    print(f"Found {len(files)} CSV files in {directory}")
    
    # Write header first
    first_file = True
    
    for filename in files:
        filepath = os.path.join(directory, filename)
        is_benign, category, attack = get_category_and_attack(filename)
        
        # Process the file in chunks
        for chunk in pd.read_csv(filepath, chunksize=chunk_size):
            chunk['is_benign'] = is_benign
            chunk['category'] = category
            chunk['attack'] = attack
            
            # Write to output file
            chunk.to_csv(output_file, 
                        mode='w' if first_file else 'a',
                        header=first_file,
                        index=False)
            first_file = False
In [55]:
base_dir = 'dataset/subset'
for dataset in ['train', 'test']:
    output_file = f'dataset/{dataset}_labeled.csv'
    
    # Process files
    process_files_in_chunks(
        directory=os.path.join(base_dir, dataset),
        output_file=output_file,
        chunk_size=100000
    )
Found 19 CSV files in dataset/subset/train
Found 19 CSV files in dataset/subset/test
In [56]:
train_df = pd.read_csv('dataset/train_labeled.csv')
test_df = pd.read_csv('dataset/test_labeled.csv')

# Get counts
train_categories = train_df['category'].value_counts()
test_categories = test_df['category'].value_counts()
train_attacks = train_df['attack'].value_counts()
test_attacks = test_df['attack'].value_counts()

# Create comparative tables
def create_comparison_table(train_counts, test_counts):
    """Create a comparison table with train and test counts."""
    # Get all unique indices
    all_indices = sorted(set(train_counts.index) | set(test_counts.index))
    
    # Create table rows
    table = [
        [
            index,
            f"{train_counts.get(index, 0):,}",
            f"{test_counts.get(index, 0):,}"
        ]
        for index in all_indices
    ]
    
    # Sort by total count (train + test)
    table.sort(key=lambda x: (
        float(x[1].replace(',', '')) + float(x[2].replace(',', ''))
    ), reverse=True)
    
    return table

# Print total rows
print(f"\nTotal Rows:")
print(f"Train: {len(train_df):,}")
print(f"Test:  {len(test_df):,}")

# Print category distribution
print("\nCategory Distribution:")
category_table = create_comparison_table(train_categories, test_categories)
print(tabulate(
    category_table,
    headers=['Category', 'Train Count', 'Test Count'],
    tablefmt='psql'
))

# Print attack distribution
print("\nAttack Type Distribution:")
attack_table = create_comparison_table(train_attacks, test_attacks)
print(tabulate(
    attack_table,
    headers=['Attack Type', 'Train Count', 'Test Count'],
    tablefmt='psql'
))
Total Rows:
Train: 1,235,465
Test:  465,367
Category Distribution:
+------------+---------------+--------------+
| Category   | Train Count   | Test Count   |
|------------+---------------+--------------|
| DDoS       | 581,986       | 210,677      |
| DoS        | 284,552       | 143,579      |
| BENIGN     | 192,732       | 37,607       |
| MQTT       | 107,607       | 46,182       |
| RECON      | 52,541        | 25,578       |
| SPOOFING   | 16,047        | 1,744        |
+------------+---------------+--------------+
Attack Type Distribution:
+--------------------+---------------+--------------+
| Attack Type        | Train Count   | Test Count   |
|--------------------+---------------+--------------|
| DDoS UDP           | 189,596       | 62,207       |
| DDoS ICMP          | 179,748       | 60,970       |
| Benign             | 192,732       | 37,607       |
| DDoS TCP           | 106,446       | 44,260       |
| DDoS SYN           | 106,196       | 43,240       |
| DoS UDP            | 82,695        | 39,755       |
| DoS SYN            | 70,190        | 35,719       |
| DoS ICMP           | 67,629        | 35,686       |
| DoS TCP            | 64,038        | 32,419       |
| DDoS Connect Flood | 43,304        | 24,383       |
| Port Scan          | 32,796        | 20,524       |
| DoS Publish Flood  | 24,875        | 8,505        |
| DDoS Publish Flood | 21,525        | 8,416        |
| OS Scan            | 16,832        | 3,834        |
| ARP Spoofing       | 16,047        | 1,744        |
| DoS Connect Flood  | 12,773        | 3,131        |
| Malformed Data     | 5,130         | 1,747        |
| Recon VulScan      | 2,173         | 1,034        |
| Ping Sweep         | 740           | 186          |
+--------------------+---------------+--------------+

2. ANALYSING AND SELECTING FEATURES¶

In [9]:
# Load the merged dataset
df = pd.read_csv('dataset/train_labeled.csv')

Checking for NULL values¶

In [ ]:
# Count the number of features with null values
null_counts = df.isnull().sum()
features_with_nulls = null_counts[null_counts > 0]

# Create a DataFrame for display
null_features_df = pd.DataFrame({'Feature': features_with_nulls.index, 'Null Count': features_with_nulls.values})

# Print the table
print(tabulate(null_features_df, headers='keys', tablefmt='psql'))
+-----------+--------------+
| Feature   | Null Count   |
|-----------+--------------|
+-----------+--------------+

Good news: there are not features with NULL values

Checking features type¶

In [ ]:
# Print type and count of features
type_counts = df.dtypes.value_counts()
print(tabulate(pd.DataFrame({'Type': type_counts.index, 'Count': type_counts.values}), headers='keys', tablefmt='psql'))
+----+---------+---------+
|    | Type    |   Count |
|----+---------+---------|
|  0 | float64 |      45 |
|  1 | object  |       2 |
|  2 | int64   |       1 |
+----+---------+---------+

We can observe that all features are float, except those we created to label the entries.

Statistical Analysis¶

The following code performs statistical analysis on the merged dataset to uncover feature relationships and characteristics. Through correlation matrices, variance metrics, and zero/missing value analysis, it reveals redundant features and quality issues that inform feature selection and preprocessing decisions.

In [10]:
df_numeric = df.select_dtypes(include=['float64', 'int64'])

# Correlation Analysis
corr = df_numeric.corr(method = 'pearson')

# Create visualizations
plt.figure(figsize=(15, 10))
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Draw the heatmap with the mask and correct aspect ratio
cmap = sns.diverging_palette(12, 150, s=100, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
        square=True, linewidths=.5, cbar_kws={"shrink": .5},
        xticklabels=True, yticklabels=True)
plt.title('Feature Correlation Matrix ')
plt.tight_layout()
plt.savefig('img/correlation.png', bbox_inches='tight')
plt.show()
No description has been provided for this image
In [11]:
# Print correlations
print("\n--- Features with >90% correlation ---")
corr_matrix = corr
high_corr = np.where(np.abs(corr_matrix) >= 0.9)
high_corr_data = []
for i, j in zip(*high_corr):
    if i < j:  # avoid duplicate pairs
        high_corr_data.append([
            corr_matrix.index[i], 
            corr_matrix.columns[j], 
            corr_matrix.iloc[i, j]
        ])
print(tabulate(high_corr_data, 
            headers=["Feature 1", "Feature 2", "Correlation"], 
            tablefmt="grid",
            floatfmt=".5f"))
--- Features with >90% correlation ---
+---------------+-------------+---------------+
| Feature 1     | Feature 2   |   Correlation |
+===============+=============+===============+
| Protocol Type | UDP         |       0.93744 |
+---------------+-------------+---------------+
| Rate          | Srate       |       1.00000 |
+---------------+-------------+---------------+
| ARP           | IPv         |      -1.00000 |
+---------------+-------------+---------------+
| ARP           | LLC         |      -1.00000 |
+---------------+-------------+---------------+
| IPv           | LLC         |       1.00000 |
+---------------+-------------+---------------+
| Tot sum       | AVG         |       0.90472 |
+---------------+-------------+---------------+
| Max           | Magnitue    |       0.93071 |
+---------------+-------------+---------------+
| AVG           | Tot size    |       0.97746 |
+---------------+-------------+---------------+
| AVG           | Magnitue    |       0.97982 |
+---------------+-------------+---------------+
| Std           | Radius      |       0.99997 |
+---------------+-------------+---------------+
| Std           | Covariance  |       0.95656 |
+---------------+-------------+---------------+
| Tot size      | Magnitue    |       0.95790 |
+---------------+-------------+---------------+
| IAT           | Number      |       0.99914 |
+---------------+-------------+---------------+
| IAT           | Weight      |       0.99933 |
+---------------+-------------+---------------+
| Number        | Weight      |       0.99995 |
+---------------+-------------+---------------+
| Radius        | Covariance  |       0.95657 |
+---------------+-------------+---------------+
In [12]:
# Feature Variance
variance = df_numeric.var().sort_values(ascending=False)

# Print variance
print("\n--- Top 10 High Variance Features ---")
variance_data = [[feature, value] for feature, value in 
                variance.head(10).items()]
print(tabulate(variance_data, 
            headers=["Feature", "Variance"], 
            tablefmt="grid",
            floatfmt=".5f"))
--- Top 10 High Variance Features ---
+---------------+------------------------+
| Feature       |               Variance |
+===============+========================+
| IAT           | 1549362499752095.75000 |
+---------------+------------------------+
| Header_Length |     405017979418.82312 |
+---------------+------------------------+
| Covariance    |       1661236197.27687 |
+---------------+------------------------+
| Srate         |       1197455513.77209 |
+---------------+------------------------+
| Rate          |       1197455513.77209 |
+---------------+------------------------+
| Tot sum       |          4509439.03181 |
+---------------+------------------------+
| rst_count     |          1378529.44909 |
+---------------+------------------------+
| Max           |            74750.58662 |
+---------------+------------------------+
| AVG           |            34443.39249 |
+---------------+------------------------+
| Tot size      |            34366.50876 |
+---------------+------------------------+
In [15]:
# Zero Values Percentage
zero_percent = (df_numeric == 0).sum() / len(df_numeric) * 100

# Print features with high zero percentage
print("\n--- Top 10 Zero Value Percentages ---")
zero_data = [[feature, value] for feature, value in 
            zero_percent.sort_values(ascending=False).head(10).items()]
print(tabulate(zero_data, 
            headers=["Feature", "Zero %"], 
            tablefmt="grid",
            floatfmt=".5f"))
--- Top 10 Zero Value Percentages ---
+-----------------+-----------+
| Feature         |    Zero % |
+=================+===========+
| Drate           | 100.00000 |
+-----------------+-----------+
| DHCP            |  99.99409 |
+-----------------+-----------+
| IGMP            |  99.98867 |
+-----------------+-----------+
| cwr_flag_number |  99.98810 |
+-----------------+-----------+
| ece_flag_number |  99.98462 |
+-----------------+-----------+
| IRC             |  99.93128 |
+-----------------+-----------+
| SMTP            |  99.93055 |
+-----------------+-----------+
| Telnet          |  99.92990 |
+-----------------+-----------+
| SSH             |  99.92788 |
+-----------------+-----------+
| HTTP            |  99.46749 |
+-----------------+-----------+

Selection Method 1: Manual Selection¶

We can see, both from the correlation matrix and from the percentages of zero values, that Drate column is empty: we dropped it.

In [162]:
df=df.drop('Drate', axis=1)

This was the only trivial choice. However, further selection is needed: we can see that there is high correlation within data, and this can negatively affect the quality of our models. For this reason, we want to keep only non-redundant features to maximise the discriminatory power of the variables.

We based our first approach on the simple observation of the output of the previous analysis. Through a combination of correlation and variance, we decided to drop the following columns:

In [163]:
to_drop = ["Srate", "Protocol Type", "IPv", "LLC", "Tot sum", "Tot size", "AVG", "Max", "Number", "Weight"]
df_sel_manual = df.drop(columns=to_drop)

As a starting point, in order to evaluate our dataset and choices, we set up a first Classification model. We want to predict if a packet flow is benign or malicious using a Random Forest. Since we have an abundance of data, we split the training dataset into Train and Validation sets, to avoid overfitting the test dataset in this feature selection phase.

In [164]:
# Splitting the Dataset in Training and Validation sets
X_train, X_val, y_train, y_val = train_test_split(
    df_sel_manual.drop(['is_benign', 'category', 'attack'], axis=1),
    df_sel_manual['is_benign'], train_size=0.7, random_state=37
)
In [165]:
# Training the RandomForest
rnd_forest = RandomForestClassifier(random_state=37)
rnd_forest.fit(X_train, y_train);
In [166]:
dump(rnd_forest, 'models/random_forest_sel_man.joblib');

A useful metric that the RandomForest ensemble provides is the Feature Importance (Gini index). This can be used for further refinement of our feature selection.

In [167]:
feature_importances_dict = dict(zip(X_train.columns, rnd_forest.feature_importances_))

# Sort features by importance (descending order)
sorted_dict = dict(sorted(feature_importances_dict.items(), key=lambda x: x[1], reverse=True))

# Create plot
plt.figure(figsize=(10, 6))
sns.barplot(
    x=list(sorted_dict.values()),
    y=list(sorted_dict.keys()),
    hue=list(sorted_dict.keys()),  # Add this line
    palette="Blues_r"
)
plt.xlabel("Importance", fontsize=12)
plt.ylabel("Features", fontsize=12)
plt.title("Feature Importances", fontsize=14, fontweight="bold")
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
No description has been provided for this image

Evaluating the performance on the validation set, using common metrics:

In [168]:
def compute_metrics(y_true, y_pred):
    """
    Print comprehensive model evaluation metrics with both rich text output and seaborn heatmap.
    
    Parameters:
    -----------
    y_true : array-like
        True labels
    y_pred : array-like
        Predicted labels
    """
    
    # Calculate core metrics
    metrics = {
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred),
        "Recall": recall_score(y_true, y_pred),
        "F1 Score": f1_score(y_true, y_pred)
    }
    
    # Create metrics table
    table_data = [[metric, f"{value:.5f}"] for metric, value in metrics.items()]
    table = tabulate(table_data, headers=["Metric", "Score"], tablefmt="psql")
    
    print(table)
    
    return metrics
In [169]:
y_pred = rnd_forest.predict(X_val)
with open('models/random_forest_sel_man.json', 'w') as f:
    json.dump(compute_metrics(y_val, y_pred), f);
+-----------+---------+
| Metric    |   Score |
|-----------+---------|
| Accuracy  | 0.99667 |
| Precision | 0.98413 |
| Recall    | 0.99477 |
| F1 Score  | 0.98942 |
+-----------+---------+

Selection Method 2: Hierarchial Clustering Selection¶

The results are good, but now let's try selecting the features with a more robust and reliable approach. This time, beyond the correlation calculation (same as we did before), we compute a Hierarchical Clustering, to group the features which are more correlated. We selected Euclidean distance and complete-linkage.

In [170]:
# Parameters
corr_method = 'pearson'
link_method = 'complete'
dist_metric = 'euclidean'

# Compute correlation matrix, excluding categorical columns
corr = df.drop(['is_benign', 'attack', 'category'], axis=1).corr(method=corr_method)
cmap = sns.diverging_palette(12, 150, s=100, as_cmap=True)

# Create initial clustermap to get linkage matrix
g = sns.clustermap(
    corr,
    annot=False,
    cbar=True,
    cmap=cmap,
    center=0,
    linewidths=.5,
    cbar_kws={"shrink": .5},
    metric=dist_metric,
    method=link_method,
    xticklabels=True,
    yticklabels=True
)
g.figure.suptitle(
    f"Clustermap (ClusterMethod: {link_method}, DistanceMetric: {dist_metric}, CorrelationMethod: {corr_method})",
    y=1.05
)

# Create detailed visualization with dendrogram and correlation matrix
linkage_matrix = g.dendrogram_col.linkage
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))

# Left plot: Dendrogram
dendrogram(
    linkage_matrix,
    ax=ax1,
    labels=corr.columns
)
ax1.set_title("Hierarchical Clustering Dendrogram")

# Right plot: Reordered correlation matrix
dendro_idx = g.dendrogram_col.reordered_ind
reordered_corr = corr.iloc[dendro_idx, dendro_idx]

sns.heatmap(
    reordered_corr,
    annot=False,
    cmap=cmap,
    cbar=True,
    ax=ax2,
    cbar_kws={"shrink": .5},
    linewidths=0.5,
    square=True,
    xticklabels=True,
    yticklabels=True
)
ax2.set_title("Correlation Matrix")

# Final layout adjustments
fig.suptitle(
    f"Clustermap (ClusterMethod: {link_method}, DistanceMetric: {dist_metric}, CorrelationMethod: {corr_method})",
    y=1.05
)
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image

Our decision in this step is constrained to selecting the distance threshold by visually inspecting the dendrogram. The remaining process is automated by the script, which identifies the feature with the highest importance (based on Gini) for each cluster formed at the chosen threshold.

In [171]:
threshold = 1.0 

# Group features by cluster
clusters = fcluster(linkage_matrix, t=threshold, criterion='distance')
clustered_features = {}
for idx, cluster in enumerate(clusters):
    if cluster not in clustered_features:
        clustered_features[cluster] = []
    clustered_features[cluster].append(corr.columns[idx])

# Print clusters in a clean, sorted format
print("Clustered Features:\n" + "-"*20)
# Sort by cluster number for better readability
for cluster in sorted(clustered_features.keys()):
    features = clustered_features[cluster]
    # Join features with commas and proper spacing
    feature_list = ", ".join(features)
    print(f"Cluster {cluster:2d}: {feature_list}")
Clustered Features:
--------------------
Cluster  1: Rate, Srate
Cluster  2: Protocol Type, UDP
Cluster  3: IPv, LLC
Cluster  4: ICMP
Cluster  5: IAT, Number, Weight
Cluster  6: ece_flag_number, cwr_flag_number
Cluster  7: Telnet, SMTP, IRC
Cluster  8: SSH
Cluster  9: DHCP
Cluster 10: IGMP
Cluster 11: DNS
Cluster 12: Duration
Cluster 13: HTTP
Cluster 14: syn_flag_number
Cluster 15: syn_count
Cluster 16: fin_flag_number, ack_count
Cluster 17: rst_flag_number
Cluster 18: ARP
Cluster 19: psh_flag_number, ack_flag_number, Variance
Cluster 20: rst_count
Cluster 21: TCP
Cluster 22: Std, Radius, Covariance
Cluster 23: fin_count
Cluster 24: Tot sum, Max, AVG, Tot size, Magnitue
Cluster 25: Header_Length, HTTPS
Cluster 26: Min
In [172]:
# Prepare data for Random Forest
X = df.drop(['is_benign', 'category', 'attack'], axis=1)
y = df['is_benign']
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.7, random_state=37)
In [173]:
# Train Random Forest and get feature importances
rnd_forest.fit(X_train, y_train)
feature_importances = dict(zip(X_train.columns, rnd_forest.feature_importances_))

# Sort features by importance
sorted_features = sorted(
    feature_importances.items(), 
    key=lambda x: x[1], 
    reverse=True
)

# Select most important feature from each cluster
selected_features = []
for cluster, features in clustered_features.items():
    # Get importance scores for features in this cluster
    cluster_importances = {
        feature: feature_importances[feature] 
        for feature in features
    }
    # Select feature with highest importance
    best_feature = max(cluster_importances, key=cluster_importances.get)
    selected_features.append(best_feature)
In [174]:
print("\nSelected features from each cluster:")
for feature in selected_features:
    print(f"  - {feature}")
Selected features from each cluster:
  - Header_Length
  - UDP
  - Duration
  - Rate
  - ack_count
  - syn_flag_number
  - rst_flag_number
  - Variance
  - ece_flag_number
  - syn_count
  - fin_count
  - rst_count
  - HTTP
  - DNS
  - SMTP
  - SSH
  - TCP
  - DHCP
  - ARP
  - ICMP
  - IGMP
  - IPv
  - AVG
  - Min
  - Covariance
  - IAT

Let's train a new RandomForest with the dataset containing the selected features:

In [175]:
selected_features.extend(['is_benign', 'attack', 'category'])
df_sel_hclust = df[selected_features]

X_train, X_val, y_train, y_val = train_test_split(
    df_sel_hclust.drop(['is_benign', 'category', 'attack'], axis=1),
    df_sel_hclust['is_benign'], train_size=0.7, random_state=37
)
rnd_forest.fit(X_train, y_train);
In [ ]:
dump(rnd_forest, 'models/random_forest_sel_hclust.joblib');
In [176]:
feature_importances_dict = dict(zip(X_train.columns, rnd_forest.feature_importances_))

# Sort features by importance (descending order)
sorted_dict = dict(sorted(feature_importances_dict.items(), key=lambda x: x[1], reverse=True))

# Create plot
plt.figure(figsize=(10, 6))
sns.barplot(
    x=list(sorted_dict.values()),
    y=list(sorted_dict.keys()),
    hue=list(sorted_dict.keys()),
    palette="Blues_r"
)
plt.xlabel("Importance", fontsize=12)
plt.ylabel("Features", fontsize=12)
plt.title("Feature Importances", fontsize=14, fontweight="bold")
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
No description has been provided for this image

Let's test the performance of the new approach used for feature selection:

In [177]:
y_pred = rnd_forest.predict(X_val)
with open('models/random_forest_sel_hclust.json', 'w') as f:
    json.dump(compute_metrics(y_val, y_pred), f);
+-----------+---------+
| Metric    |   Score |
|-----------+---------|
| Accuracy  | 0.99661 |
| Precision | 0.98381 |
| Recall    | 0.99472 |
| F1 Score  | 0.98924 |
+-----------+---------+
In [178]:
# Load the performance metrics
with open('models/random_forest_sel_hclust.json', 'r') as f:
    metrics_sel_hclust = json.load(f)

# Load the performance metrics
with open('models/random_forest_sel_man.json', 'r') as f:
    metrics_sel_manual = json.load(f)

# Metric names
metric_names = ["Accuracy", "Precision", "Recall", "F1 Score"]

# Combine into a table
table = zip(metric_names, [metrics_sel_hclust[m] for m in metric_names], [metrics_sel_manual[m] for m in metric_names])

# Print the table
print(tabulate(table, headers=["Metric", "HClust Selection", "Manual Selection"], floatfmt=".3f"))
Metric       HClust Selection    Manual Selection
---------  ------------------  ------------------
Accuracy                0.997               0.997
Precision               0.984               0.984
Recall                  0.995               0.995
F1 Score                0.989               0.989

These results validate both our methods and provide an indication of the strong performance of the Random Forest model.

Let's save the new dataset with the selected features:

In [179]:
df_sel_hclust.to_csv('dataset/train_sel_hclust.csv', index=False)

3. FEATURES VISUALISATION¶

Let's compare the distribution of the attacks with the benign data flows, using the top 5 most important features:

In [25]:
df_sel = pd.read_csv('dataset/train_sel_hclust.csv')
In [26]:
def plot_binary_distributions(df, features):
    # Apply Min-Max scaling
    scaler = MinMaxScaler()
    df[features] = scaler.fit_transform(df[features])

    df_copy = df.sample(frac=0.1, random_state=37)

    plt.figure(figsize=(20, 10))
    for i, feature in enumerate(features):
        plt.subplot(2, (len(features) + 1) // 2, i + 1)
        sns.kdeplot(data=df_copy, x=feature, hue='is_benign', 
                    hue_order=[0, 1], 
                    fill=True, common_norm=False,
                    palette={0: 'red', 1: 'green'})  # Fixed colors for benign and malicious
        plt.title(f'Benign vs Malicious - {feature}')
        plt.xlabel(feature)
        plt.legend(title='NATURE', labels=['benign', 'malicious'])
    plt.tight_layout()
    plt.show()

def plot_attack_distributions(df, feature):
    # Apply Min-Max scaling
    scaler = MinMaxScaler()
    df[feature] = scaler.fit_transform(df[[feature]])
    
    plt.figure(figsize=(20, 10))
    attacks = ['SPOOFING', 'RECON', 'MQTT', 'DoS', 'DDoS']
    for attack in attacks:
        data = df[df['category'].isin([attack, 'BENIGN'])]
        plt.subplot(2, 3, attacks.index(attack) + 1)
        sns.kdeplot(data=data, x=feature, hue='category', 
                   hue_order=['BENIGN', attack], 
                   fill=True, common_norm=False, 
                   palette={attack: 'red', 'BENIGN': 'green'})
        plt.title(f'BENIGN vs {attack}')
        plt.xlabel(feature)
    plt.tight_layout()
    plt.show()
In [27]:
feature_importances_dict = dict(
    zip(df_sel.drop(['is_benign', 'category', 'attack'], axis=1)
        , rnd_forest.feature_importances_))

# Sort features by importance and get top 5 feature names
top_5_features = [feature[0] for feature in sorted(
    feature_importances_dict.items(), 
    key=lambda x: x[1], 
    reverse=True
)[:5]]

Benign vs Malicious¶

In [28]:
plot_binary_distributions(df_sel.copy(), top_5_features)
No description has been provided for this image

Benign vs Attack Category¶

In [29]:
for feature_name in top_5_features:
    plot_attack_distributions(df_sel.copy(), feature_name)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

rst_count (count of rst flags occurrences in packets): as concerns rst_count feature, the graphs show that its values are concentrated near zero in case of attacks, regardless of the specific attack category, while its distribution is wider and characterized by higher values for benign data. It can be a useful indicator in order to distinguish between the two classes.

HEADER_LENGTH (mean of the header lengths of the transport layer): except for spoofing, which shows a wider distribution of values, in the other cases the Header_Length can be discriminative between the two phenomena; the separation is less noticeable for MQTT.

Variance (variance of the packet lengths in the window): about Variance, the distinction in the distribution of the two classes is more remarkable in RECON, DoS and DDoS, since benign traffic is featured by higher values than attack-related data.

IAT (interval mean between current and previous packet in the window): on the one hand IAT doesn’t appear to be a strong indicator for differentiate attacks like spoofing and recon, but it certainly does for MQTT, DoS and DDoS, where the separation between malicious and benign traffic is very clear. In general, the values for this feature are concentrated between 0, 0.5 and 1.

RATE: the distributions of SPOOFING and RECON are almost inverted if compared with the other attack categories.

DoS and DDoS: it's not surprising that this two types of attack have similar distributions across all features.

4. SUPERVISED LEARNING¶

4.1 BINARY CLASSIFICATION¶

Data preprocessing¶

In order to have a balanced dataset for binary decision, we decided to mantain all benign flow and randomly drop entries of malicious samples in order to have the same quantity of data on both sides.

In [2]:
df = pd.read_csv('dataset/train_sel_hclust.csv')
In [3]:
def balance_dataset(df: pd.DataFrame, target_column: str = 'is_benign') -> pd.DataFrame:
    # Separate the dataset into two groups
    df_ones = df[df[target_column] == 1]
    df_zeros = df[df[target_column] == 0]
    
    # Determine the number of samples to keep from the zeros group
    num_ones = len(df_ones)
    df_zeros_balanced = df_zeros.sample(n=num_ones, random_state=37)
    
    # Concatenate the balanced zeros with the ones
    df_balanced = pd.concat([df_ones, df_zeros_balanced], ignore_index=True)
    
    # Shuffle the dataset
    df_balanced = df_balanced.sample(frac=1, random_state=37).reset_index(drop=True)
    df_balanced.to_csv('dataset/train_binary.csv', index=False)
    
    return df_balanced
In [ ]:
# Balance the dataset
df = balance_dataset(df)

# Display the distribution of the target column
print(df['is_benign'].value_counts())
is_benign
1    192732
0    192732
Name: count, dtype: int64

Train-Validation split:

In [47]:
# Separate features and target
X = df.drop(['is_benign', 'category', 'attack'], axis=1)
y = df['is_benign']

# Split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.7, random_state=37)

We defined the following function to calculate for each model the chosen metrics and plot the confusion matrix. We used multiple metrics to maximise the effectiveness of each model in detecting and classifying cyber threats:

  • Accuracy measures the overall correctness of the model by assessing the proportion of correctly classified instances.
  • Precision ensures the model's reliability in identifying true positives without excessive false alarms.
  • Recall (also called sensitivity or true positive rate) measures the model's ability to detect all relevant instances, especially critical in cybersecurity to avoid overlooking threats.
  • F1-score balances between precision and recall.
In [34]:
def evaluate_model(y_true, y_pred, model_name="Model", class_names=None):
    """
    Print comprehensive model evaluation metrics with both rich text output and seaborn heatmap.
    
    Parameters:
    -----------
    y_true : array-like
        True labels
    y_pred : array-like
        Predicted labels
    model_name : str, optional
        Name of the model for display purposes
    class_names : list, optional
        List of class names for axis labels
    """
    
    # Calculate core metrics
    metrics = {
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred),
        "Recall": recall_score(y_true, y_pred),
        "F1 Score": f1_score(y_true, y_pred)
    }
    
    # Create metrics table
    table_data = [[metric, f"{value:.5f}"] for metric, value in metrics.items()]
    table = tabulate(table_data, headers=["Metric", "Score"], tablefmt="psql")
    
    print(table)
    
    # Calculate and plot confusion matrix as heatmap
    cm = confusion_matrix(y_true, y_pred)
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    # Create heatmap with enhanced annotations
    plt.figure(figsize=(18, 10))
    if class_names is None:
        class_names = [f"Class {i}" for i in range(cm.shape[0])]
    
    sns.heatmap(
        cm_normalized,
        annot=True,
        fmt='.2f',
        cmap='Blues',
        xticklabels=class_names,
        yticklabels=class_names,
        annot_kws={'size': 16, 'weight': 'bold'}  # Increased font size and made bold
    )
    
    # Adjust title and label properties
    plt.title(f'{model_name} - Normalized Confusion Matrix', fontsize=16, pad=20)
    plt.xlabel('Predicted', fontsize=14, weight='bold')
    plt.ylabel('True', fontsize=14, weight='bold')
    
    # Adjust tick labels size
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    
    plt.tight_layout()
    plt.show()
    
    return metrics

LOGISTIC REGRESSION¶

We started classifying using a linear method

Hyperparameters Optimisation

Optimising regularization parameter C (inverse of regularization strength), using GridSearch

In [35]:
# Define LogisticRegression and parameter grid
log_reg  = LogisticRegression()

param_grid = {
    'penalty': ['l2'],
    'C': np.linspace(0.1, 4.0, 20), 
    'max_iter': [10000],
    'random_state': [37]
}

# Set up StratifiedKFold with random_state for reproducibility
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=37)

# Set up GridSearchCV
search = GridSearchCV(
    log_reg,
    param_grid,
    scoring=['accuracy', 'precision', 'recall', 'f1'],
    refit='accuracy',
    cv=cv,
    verbose=True
)

TRAINING

In [37]:
# Start the time
start_time = time.time()

# Fit the grid search
search.fit(X_train, y_train);

# Stop the timer and print the result
elapsed_time = time.time() - start_time
print(f"Elapsed time {elapsed_time:.2f} seconds")

best_model = search.best_estimator_
best_params = search.best_params_

# Print the best parameters in a table
param_table = [[param, value] for param, value in best_params.items()]
print("Best Parameters by Average Score:")
print(tabulate(param_table, headers=["Parameter", "Value"], tablefmt="psql"))
Fitting 3 folds for each of 20 candidates, totalling 60 fits
Elapsed time 365.35 seconds
Best Parameters by Average Score:
+--------------+--------------------+
| Parameter    | Value              |
|--------------+--------------------|
| C            | 3.3842105263157896 |
| max_iter     | 10000              |
| penalty      | l2                 |
| random_state | 37                 |
+--------------+--------------------+
In [ ]:
def plot_hyperparameter(results):
    # Plot the metrics against the parameter `C`
    param_values = results['param_C'].data.astype(float)
    metrics = ['mean_test_accuracy', 'mean_test_precision', 'mean_test_recall', 'mean_test_f1']
    
    # Create a figure with two subplots side by side
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Define the zoom region (e.g., rightmost 30% of the data)
    zoom_start = int(0.7 * len(param_values))
    
    # Dictionary to store peak locations
    peak_locations = {}
    
    # Left subplot - full view
    for metric in metrics:
        values = results[metric]
        line = ax1.plot(param_values, values, label=metric, linewidth=2)
        color = line[0].get_color()
        
        # Find the peak in the zoomed region
        zoom_values = values[zoom_start:]
        zoom_params = param_values[zoom_start:]
        peak_idx = zoom_start + np.argmax(zoom_values)
        peak_value = values[peak_idx]
        peak_param = param_values[peak_idx]
        peak_locations[metric] = (peak_param, peak_value)
        
        # Add vertical line at peak location in full view
        ax1.axvline(x=peak_param, color=color, linestyle='--', alpha=0.5)
    
    ax1.set_xlabel('C (Inverse Regularization Strength)')
    ax1.set_ylabel('Score')
    ax1.set_title('Full Range View')
    ax1.legend(loc='upper left')
    ax1.grid(True)
    
    # Right subplot - zoomed view
    for metric in metrics:
        values = results[metric]
        line = ax2.plot(param_values, values, label=metric, linewidth=2)
        color = line[0].get_color()
        
        # Get the peak information
        peak_param, peak_value = peak_locations[metric]
        
        # Add a marker at the peak
        ax2.plot(peak_param, peak_value, 'o', markersize=8, 
                color=color, markeredgecolor='black', markeredgewidth=2)
        
        # Add annotation for the peak value
        ax2.annotate(f'{peak_value:.3f}', 
                    xy=(peak_param, peak_value),
                    xytext=(10, 5),
                    textcoords='offset points',
                    ha='left',
                    va='bottom',
                    bbox=dict(boxstyle='round,pad=0.5', fc='white', alpha=0.8),
                    arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0'))
    
    ax2.set_xlabel('C (Inverse Regularization Strength)')
    ax2.set_title('Zoomed View of Peak Region')
    ax2.grid(True)
    
    # Set the x-axis limits for the zoomed view
    ax2.set_xlim(param_values[zoom_start], param_values[-1])
    
    # Adjust layout to prevent overlap
    plt.tight_layout()
    plt.show()
# Plot the hyperparameter results
plot_hyperparameter(search.cv_results_)
No description has been provided for this image

VALIDATION

In [ ]:
y_pred = best_model.predict(X_val)
# Evaluate the model
_ = evaluate_model(y_val, y_pred, "Logistic Regression", ["Benign", "Malicious"])
+-----------+---------+
| Metric    |   Score |
|-----------+---------|
| Accuracy  | 0.91639 |
| Precision | 0.92033 |
| Recall    | 0.91639 |
| F1 Score  | 0.91618 |
+-----------+---------+
No description has been provided for this image
In [39]:
dump(best_model, 'models/log_reg_binary.joblib');

RANDOM FOREST¶

After the logistic regression, we implemented an ensemble learning method based on decision trees.

Choosing the number of trees to use in the ensemble

In a Random Forest, each tree is trained on a subset of the training data selected by bootstrap sampling. The "out-of-bag" (OOB) data is the portion of the training data that was not selected during the bootstrap sampling. The OOB error is a measure of the model's predictive performance.

Observing the OOB estimation variation with the number of trees helps understanding after how many trees the model stabilizes, thus what's the optimal number of trees to train our RandomForest on. The point where the OOB error stops improving/fluctuating significantly indicates that the model generalizes well to unseen data, and adding more trees does not meaningfully improve the model performance.

In [19]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

def calculate_oob_errors(X_train, y_train, n_trees_values, random_state=37):
    """Calculate OOB errors for a range of tree counts in RandomForest."""
    oob_errors = []
    for n_trees in n_trees_values:
        temp_forest = RandomForestClassifier(n_estimators=n_trees, oob_score=True, random_state=random_state)
        temp_forest.fit(X_train, y_train)
        oob_errors.append(1 - temp_forest.oob_score_)
    return oob_errors

# Define range of tree counts to evaluate
n_trees_values = range(1, 100, 5)

# Calculate OOB errors
oob_errors = calculate_oob_errors(X_train, y_train, n_trees_values)

# Plot the OOB error
plt.figure(figsize=(20, 12))
plt.plot(n_trees_values, oob_errors, marker='o', linestyle='-')
plt.xlabel("Number of Trees")
plt.ylabel("OOB Error")
plt.title("Out-of-Bag Error vs Number of Trees")
plt.grid(True)
plt.show()
No description has been provided for this image

Optimising Hyperparameters

We used a Grid Search, with 3-folds cross validation, to choose the best function to measure the quality of a split ('criterion') and the best maximum depth of the trees in our ensemble ('max_depth')

We also defined a custom scorer function to define what we intend as "best model": the one that maximises the average of our evaluation metrics.

In [8]:
def average_score(y_true, y_pred):
    return (accuracy_score(y_true, y_pred) + 
            precision_score(y_true, y_pred) + 
            recall_score(y_true, y_pred) + 
            f1_score(y_true, y_pred)) / 4
In [26]:
average_scorer = make_scorer(average_score)

# Define RandomForestClassifier and parameter grid
rnd_forest = RandomForestClassifier()

param_grid = {
    'criterion': ['gini', 'entropy', 'log_loss'], 
    'max_depth': [2, 5, 10, 15, 20, 25, None],
    'random_state': [37]
}

# Set up StratifiedKFold with random_state for reproducibility
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=37)

# Set up GridSearchCV
search = GridSearchCV(
    rnd_forest,
    param_grid,
    scoring=average_scorer,  # Pass the custom scorer
    refit=True,
    cv=cv,
    verbose=True,
)

TRAINING

In [28]:
start_time = time.time()

# Fit the grid search
search.fit(X_train, y_train);

elapsed_time = time.time() - start_time
print(f"Elapsed time {elapsed_time:.2f} seconds")

# Access the best model and parameters
best_model = search.best_estimator_
best_params = search.best_params_

# Print the best parameters in a table
param_table = [[param, value] for param, value in best_params.items()]
print("Best Parameters by Average Score:")
print(tabulate(param_table, headers=["Parameter", "Value"], tablefmt="psql"))
Fitting 3 folds for each of 21 candidates, totalling 63 fits
Elapsed time 972.89 seconds
Best Parameters by Average Score:
+--------------+---------+
| Parameter    | Value   |
|--------------+---------|
| criterion    | entropy |
| max_depth    |         |
| random_state | 37      |
+--------------+---------+

VALIDATION

In [ ]:
y_pred = best_model.predict(X_val)
# Evaluate the model
_ = evaluate_model(y_val, y_pred, "Random Forest", ["Benign", "Malicious"])
+-----------+---------+
| Metric    |   Score |
|-----------+---------|
| Accuracy  | 0.99707 |
| Precision | 0.99708 |
| Recall    | 0.99707 |
| F1 Score  | 0.99707 |
+-----------+---------+
No description has been provided for this image
In [37]:
dump(best_model, 'models/rnd_forest_binary.joblib');

XGBOOST¶

In this section the aim is to check another ensemble model based on decision trees, in view of the results obtained using the Random Forest.

SETTING UP THE MODEL

In [9]:
average_scorer = make_scorer(average_score)

# Initial model
xgboost = xgb.XGBClassifier(
    objective='binary:logistic',
    scale_pos_weight=len(y_train[y_train==0]) / len(y_train[y_train==1]),
    random_state=37
)

# Parameter grid
param_grid = {
    'max_depth': [3, 5, 7],
    'min_child_weight': [1, 3, 5],
    'gamma': [0, 0.1, 0.2],
    'subsample': [0.8, 0.9],
    'colsample_bytree': [0.8, 0.9]
}

# Set up StratifiedKFold with random_state for reproducibility
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=37)


# Perform grid search
search = GridSearchCV(
    xgboost,
    param_grid,
    cv=cv,
    scoring=average_scorer,
    n_jobs=-1,
    verbose=1,
    return_train_score=True
)

TRAINING

In [12]:
start_time = time.time()

# Fit the grid search
search.fit(X_train, y_train)

elapsed_time = time.time() - start_time
print(f"Elapsed time {elapsed_time:.2f} seconds")

# Access the best model and parameters
best_model = search.best_estimator_
best_params = search.best_params_

# Print the best parameters in a table
param_table = [[param, value] for param, value in best_params.items()]
print("Best Parameters by Average Score:")
print(tabulate(param_table, headers=["Parameter", "Value"], tablefmt="psql"))
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Elapsed time 193.84 seconds
Best Parameters by Average Score:
+------------------+---------+
| Parameter        |   Value |
|------------------+---------|
| colsample_bytree |     0.9 |
| gamma            |     0.2 |
| max_depth        |     7   |
| min_child_weight |     1   |
| subsample        |     0.9 |
+------------------+---------+

VALIDATION

In [51]:
y_pred = best_model.predict(X_val)
# Evaluate the model
_ = evaluate_model(y_val, y_pred, "XGBoost", ["Benign", "Malicious"])
+-----------+---------+
| Metric    |   Score |
|-----------+---------|
| Accuracy  | 0.99697 |
| Precision | 0.99698 |
| Recall    | 0.99697 |
| F1 Score  | 0.99697 |
+-----------+---------+
No description has been provided for this image
In [19]:
dump(best_model, 'models/xgboost_binary.joblib');

MODELS COMPARISON¶

LOADING MODELS

In [9]:
log_reg = load('models/log_reg_binary.joblib')
rnd_frs = load('models/rnd_forest_binary.joblib')
xgboost = load('models/xgboost_binary.joblib')    

models = {
    'Logistic Regression': log_reg,
    'Random Forest': rnd_frs,
    'XGBoost': xgboost
}

# Load the test dataset
df = pd.read_csv('dataset/train_sel_hclust.csv')

# Separate features and target
X = df.drop(['is_benign', 'category', 'attack'], axis=1)
y = df['is_benign']

# Split the data
_, X_val, _, y_val = train_test_split(X, y, train_size=0.7, random_state=37)
CALIBRATION¶

The calibration plot shown below compares the probabilities predicted by our models with the real observed frequency of positive values.

In [53]:
fig, ax_calibration_curve = plt.subplots(figsize=(10, 5))

for name, clf in models.items():
    CalibrationDisplay.from_estimator(
        clf,
        X_val,
        y_val,
        name=name,
        ax=ax_calibration_curve
    )
    
ax_calibration_curve.set_xlim([0, 1])
ax_calibration_curve.set_ylim([0, 1])
ax_calibration_curve.grid()
ax_calibration_curve.set_title("Calibration plots")
plt.show()
No description has been provided for this image

We can see how the blue line of Logistic Regression is closer to the one representing the perfect calibration (where the predicted probability exactly matches the observed probability), thus is the best calibrated model.

ROC-AUC¶

Plotting ROC curve, and calculating the Area Under the Curve (AUC), helps us understand sensibility and specificity of the two models, representing the ratio between true positive rate and false positives rate (false alarms).

In [65]:
def plot_roc_curve(models, X_val, y_val):
    # Create a figure with two subplots side by side
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
    
    # Plot in both subplots
    for name, model in models.items():
        y_proba = model.predict_proba(X_val)
        fpr, tpr, _ = roc_curve(y_val, y_proba[:, 1])
        auc = roc_auc_score(y_val, y_proba[:, 1])
        
        # Plot on the left subplot (full range)
        ax1.plot(fpr, tpr, label=f'{name} (AUC = {auc:.3f})', linewidth=2)
        
        # Plot on the right subplot (zoomed)
        ax2.plot(fpr, tpr, label=f'{name} (AUC = {auc:.3f})', linewidth=2)
    
    # Configure left subplot (full range)
    ax1.plot([0, 1], [0, 1], 'k:', label='Random classifier')
    ax1.set_xlim([-0.01, 1.01])
    ax1.set_ylim([-0.01, 1.01])
    ax1.grid(True, linestyle='--', alpha=0.4)
    ax1.set_title('ROC Curve (Full Range)', fontsize=14)
    ax1.set_xlabel('False Positive Rate', fontsize=12)
    ax1.set_ylabel('True Positive Rate', fontsize=12)
    ax1.legend(loc='lower right', fontsize=10)
    
    # Configure right subplot (zoomed)
    ax2.set_xlim([-0.01, 0.5])  # Focus on FPR from 0 to 0.3
    ax2.set_ylim([0.7, 1.01])   # Focus on TPR from 0.7 to 1.0
    ax2.grid(True, linestyle=':', alpha=0.4)
    ax2.set_title('ROC Curve (Zoomed)', fontsize=14)
    ax2.set_xlabel('False Positive Rate', fontsize=12)
    ax2.set_ylabel('True Positive Rate', fontsize=12)
    ax2.legend(loc='lower right', fontsize=10)
    
    # Customize tick spacing for zoomed plot
    ax2.xaxis.set_major_locator(plt.MultipleLocator(0.05))
    ax2.yaxis.set_major_locator(plt.MultipleLocator(0.05))
    
    # Add subtle box in the full range plot to show zoom area
    rect = plt.Rectangle((-0.01, 0.8), 1.01, 0.21, 
                        fill=False, 
                        linestyle='--', 
                        color='gray', 
                        alpha=0.5)
    ax1.add_patch(rect)
    
    plt.tight_layout()
    plt.show()

# Plot the ROC curve
plot_roc_curve(models, X_val, y_val)
No description has been provided for this image

The distribution of all our models is excellent, with the ensemble methods performing the best, as suggested by the AUC.

TESTING¶

In [10]:
df_test = pd.read_csv('dataset/test_labeled.csv')

# Remove columns from df_test that are not present in df_train
df_test = df_test[df.columns.intersection(df_test.columns)]

# Preprocess the data
X_test, y_test = df_test.drop(['is_benign', 'category', 'attack'], axis=1), df_test['is_benign']
In [11]:
def plot_model_metrics(models, X, y, name="VALIDATION"):
    metrics = ['accuracy', 'precision', 'recall', 'f1']
    x = np.arange(len(metrics))
    width = 0.15
    
    fig, ax = plt.subplots(figsize=(15, 6))
    colors = sns.color_palette("Blues", n_colors=len(models)+2)[2:]
    
    for i, ((model_name, model), color) in enumerate(zip(models.items(), colors)):
        y_pred = model.predict(X)
        metrics_values = [
            accuracy_score(y, y_pred),
            precision_score(y, y_pred),
            recall_score(y, y_pred),
            f1_score(y, y_pred)
        ]
        rects = ax.bar(x + i * width, metrics_values, width, label=model_name, color=color)
        
        for rect in rects:
            height = rect.get_height()
            ax.text(rect.get_x() + rect.get_width()/2., height/2.,
                    f'{height:.2f}',
                    ha='center', va='center',
                    color='white' if height > 0.3 else 'black',
                    fontsize=10,
                    weight='bold')
    
    ax.set_xlabel('Metrics', weight='bold')
    ax.set_title(f'Comparison of Classification Metrics ({name})', weight='bold')
    ax.set_xticks(x + width * ((len(models)-1)/2))
    ax.set_xticklabels(metrics)
    ax.set_ylim([0, 1])
    ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    
    plt.tight_layout()
    plt.show()

VALIDATION SET

In [12]:
plot_model_metrics(models, X_val, y_val)
No description has been provided for this image

TEST SET

In [13]:
plot_model_metrics(models, X_test, y_test, "TEST")
No description has been provided for this image

we can see that the RandomForest, as the XGBoost, not only performs better, but is also more robust than the Logistic Regression.

4.2 MULTI-CLASS CLASSIFICATION¶

For this section, we used the same models as for the binary classification with the addition of Neural Networks.

In [114]:
df = pd.read_csv('dataset/train_sel_hclust.csv')

Data preprocessing¶

In [121]:
# Drop unnecessary columns
X = df.drop(['is_benign', 'attack', 'category'], axis=1)
y = df['category']
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.7, random_state=37)

Defining Evalutation Method¶

As with the binary classification, we defined a function to calculate for each model the chosen metrics and plot the confusion matrix. This time, we had to choose an averaging method (through the parameter 'average') since now we have more than two classes. We chose 'macro' averaging to make each metric calculated for the single class count equally. In contrast, the 'weighted' averaging method would have adjusted the metrics based on the number of entries per class, which would have minimized the impact of a low score in a class with fewer samples, something we wanted to avoid for our purpose.

In [135]:
def evaluate_model(y_true, y_pred, model_name="Model", class_names=None):
    """
    Print comprehensive model evaluation metrics with both rich text output and seaborn heatmap.
    
    Parameters:
    -----------
    y_true : array-like
        True labels
    y_pred : array-like
        Predicted labels
    model_name : str, optional
        Name of the model for display purposes
    class_names : list, optional
        List of class names for axis labels
    """
    
    # Calculate core metrics
    metrics = {
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred, average='macro'),
        "Recall": recall_score(y_true, y_pred, average='macro'),
        "F1 Score": f1_score(y_true, y_pred, average='macro')
    }
    
    # Create metrics table
    table_data = [[metric, f"{value:.5f}"] for metric, value in metrics.items()]
    table = tabulate(table_data, headers=["Metric", "Score"], tablefmt="psql")
    
    print(table)
    
    # Calculate and plot confusion matrix as heatmap
    cm = confusion_matrix(y_true, y_pred)
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    # Create heatmap with enhanced annotations
    plt.figure(figsize=(18, 10))
    if class_names is None:
        class_names = [f"Class {i}" for i in range(cm.shape[0])]
    
    sns.heatmap(
        cm_normalized,
        annot=True,
        fmt='.2f',
        cmap='Blues',
        xticklabels=class_names,
        yticklabels=class_names,
        annot_kws={'size': 16, 'weight': 'bold'}  # Increased font size and made bold
    )
    
    # Adjust title and label properties
    plt.title(f'{model_name} - Normalized Confusion Matrix', fontsize=16, pad=20)
    plt.xlabel('Predicted', fontsize=14, weight='bold')
    plt.ylabel('True', fontsize=14, weight='bold')
    
    # Adjust tick labels size
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    
    plt.tight_layout()
    plt.show()
    
    return metrics

LOGISTIC REGRESSION¶

As we move to a multi-class classification problem, considering the nature of our data, linear models may struggle to capture the complexity of structures and relationships within it. The following code implements a Logistic Regression model to verify this assumption.

An high max_iter is necessary for this multiclass in order for the logistic regression to converge to a good optimum

In [97]:
# Initial model
log_reg = LogisticRegression(C=1.5, max_iter=20000, random_state = 37)

# Scale the data before fitting the model
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
In [122]:
start_time = time.time()

# Fit the grid search
log_reg.fit(X_train_scaled, y_train);

elapsed_time = time.time() - start_time
print(f"Elapsed time {elapsed_time:.2f} seconds")
Elapsed time 63.54 seconds
In [136]:
y_pred = log_reg.predict(X_val_scaled)
# Evaluate the model
_ = evaluate_model(y_val, y_pred, "Logistic Regression Multiclass", np.unique(y))
+-----------+---------+
| Metric    |   Score |
|-----------+---------|
| Accuracy  | 0.75682 |
| Precision | 0.79062 |
| Recall    | 0.69217 |
| F1 Score  | 0.7058  |
+-----------+---------+
No description has been provided for this image
In [124]:
dump(log_reg, 'models/log_reg_multiclass.joblib');

RANDOM FOREST¶

Optimising Hyperparameters¶

This time, we used a Bayesian Optimisation, with 3-folds cross validation, to choose the best parameters for our model.

This method enables a more intelligent search, since, instead of evaluating all combinations in the parameter grid, it builds a probabilistic model of the performance function and explores the parameter space accordingly. Thus:

  • It's possible to assign parameter ranges (for numeric parameters)
  • This search method is more efficient: it needs fewer iterations
In [125]:
def average_score(y_true, y_pred):
    return (accuracy_score(y_true, y_pred) + 
            precision_score(y_true, y_pred, average='macro') + 
            recall_score(y_true, y_pred, average='macro') + 
            f1_score(y_true, y_pred, average='macro')) / 4
In [126]:
average_score = make_scorer(average_score, needs_proba=False)

# Initial model
rnd_forest = RandomForestClassifier()

# Define the parameter space for Bayesian optimization
param_space = {
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_depth': (5, 50),
    'n_estimators': (20,100),
    'random_state': [37]
}

# Set up StratifiedKFold with random_state for reproducibility
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=37)

# Set up BayesSearchCV
search = BayesSearchCV(
    estimator=rnd_forest,
    search_spaces=param_space,
    n_iter=25,  # Number of parameter combinations to evaluate
    cv=cv,
    scoring=average_score,  # Our custom scoring metric
    verbose=0,
    random_state=37,
    n_jobs=-1  # Use all available cores
)

TRAINING

In [127]:
start_time = time.time()

# Fit the grid search
search.fit(X_train, y_train)

elapsed_time = time.time() - start_time
print(f"Elapsed time {elapsed_time:.2f} seconds")

# Access the best model and parameters
best_model = search.best_estimator_
best_params = search.best_params_

# Print the best parameters in a table
param_table = [[param, value] for param, value in best_params.items()]
print("Best Parameters by Average Score:")
print(tabulate(param_table, headers=["Parameter", "Value"], tablefmt="psql"))
Elapsed time 2048.43 seconds
Best Parameters by Average Score:
+--------------+---------+
| Parameter    | Value   |
|--------------+---------|
| criterion    | gini    |
| max_depth    | 50      |
| n_estimators | 100     |
| random_state | 37      |
+--------------+---------+

VALIDATION

In [137]:
y_pred = best_model.predict(X_val)
# Evaluate the model
_ = evaluate_model(y_val, y_pred, "Random Forest Multiclass", np.unique(y))
+-----------+---------+
| Metric    |   Score |
|-----------+---------|
| Accuracy  | 0.9957  |
| Precision | 0.98658 |
| Recall    | 0.97093 |
| F1 Score  | 0.97843 |
+-----------+---------+
No description has been provided for this image
In [131]:
dump(best_model, 'models/rnd_forest_multiclass.joblib');

XGBOOST¶

DATA PREPARATION

In [132]:
# Need to encode the target labels
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_val_encoded = le.transform(y_val)

SETTING UP THE MODEL

In [139]:
# Initial model
model = xgb.XGBClassifier(
    objective='multi:softprob',
    num_class=len(np.unique(le.classes_)),
    random_state=37,
    n_estimators=100,
    learning_rate=0.1
)

# Parameter space
param_space = {
    'max_depth': (5, 50),
    'min_child_weight': (1, 6),
    'gamma': (0, 0.5),
    'subsample': (0.6, 1.0),
    'colsample_bytree': (0.6, 1.0)
}

# Set up StratifiedKFold with random_state for reproducibility
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=37)

# Set up BayesSearchCV
search = BayesSearchCV(
    estimator = model,
    search_spaces = param_space,
    n_iter=37,
    cv=cv,
    scoring='accuracy', # Custom scorer doesn't work with label encoded targets
    n_jobs=-1,
    verbose=0,
    random_state=37
)

TRAINING

In [140]:
start_time = time.time()

# Fit the grid search
search.fit(X_train, y_train_encoded)

elapsed_time = time.time() - start_time
print(f"Elapsed time {elapsed_time:.2f} seconds")

# Access the best model and parameters
best_model = search.best_estimator_
best_params = search.best_params_

# Print the best parameters in a table
param_table = [[param, value] for param, value in best_params.items()]
print("Best Parameters by Average Score:")
print(tabulate(param_table, headers=["Parameter", "Value"], tablefmt="psql"))
Elapsed time 2232.46 seconds
Best Parameters by Average Score:
+------------------+-----------+
| Parameter        |     Value |
|------------------+-----------|
| colsample_bytree |  0.766697 |
| gamma            |  0.344138 |
| max_depth        | 16        |
| min_child_weight |  2        |
| subsample        |  0.883303 |
+------------------+-----------+

VALIDATION

In [145]:
y_pred = best_model.predict(X_val)
# Evaluate the model
_ = evaluate_model(y_val_encoded, y_pred, "XGBoost Multiclass", le.classes_)
+-----------+---------+
| Metric    |   Score |
|-----------+---------|
| Accuracy  | 0.99541 |
| Precision | 0.9833  |
| Recall    | 0.97083 |
| F1 Score  | 0.97686 |
+-----------+---------+
No description has been provided for this image
In [143]:
dump(best_model, 'models/xgboost_multiclass.joblib');

Neural Network¶

SMOTE

To address class imbalance and improve model performance:

  1. Undersampling: reduced large classes to a set upper limit.
  2. Oversampling with SMOTE: generated synthetic samples for small classes to reach a minimum count.

We apply this specifically for neural networks because they are sensitive to class imbalance, which can lead to biased predictions and poor generalization for minority classes. Balancing ensures the model trains effectively across all categories.

In [ ]:
df = pd.read_csv('dataset/train_labeled.csv')

# Get category counts
category_counts = df['category'].value_counts()
# Print category counts table
print(tabulate(
    category_counts.reset_index().values,
    headers=['Category', 'Count'],
    tablefmt='psql'
))
+------------+---------+
| Category   |   Count |
|------------+---------|
| DDoS       |  581986 |
| DoS        |  284552 |
| BENIGN     |  192732 |
| MQTT       |  107607 |
| RECON      |   52541 |
| SPOOFING   |   16047 |
+------------+---------+
In [ ]:
# Calculate target counts
min_ratio = 0.5
max_ratio = 1.1

median_count = category_counts.median()
min_samples = int(median_count * min_ratio)  # Lower bound
max_samples = int(median_count * max_ratio)  # Upper bound

# Initialize final dataframes list
final_dfs = []

# Handle undersampling for large classes
for category in category_counts[category_counts > max_samples].index:
    category_df = df[df['category'] == category]
    downsampled = resample(category_df,
                            n_samples=max_samples,
                            random_state=37)
    final_dfs.append(downsampled)

# Handle oversampling for small classes
small_categories = category_counts[category_counts < min_samples].index
if len(small_categories) > 0:
    # Prepare data for SMOTE
    small_df = df[df['category'].isin(small_categories)]
    cat_cols = ['category', 'attack']
    cat_data = small_df[cat_cols].copy()
    
    # Apply SMOTE
    smote = SMOTE(sampling_strategy={
            cat: min(category_counts[cat] * 2, min_samples) for cat in small_categories
        }, random_state=37)
    
    X_resampled, y_resampled = smote.fit_resample(
        small_df.drop(cat_cols, axis=1), 
        small_df['category']
    )
    
    # Reconstruct DataFrame
    augmented_df = pd.DataFrame(X_resampled, columns=df.drop(cat_cols, axis=1).columns)
    augmented_df['category'] = y_resampled
    augmented_df['attack'] = cat_data['attack'].iloc[0]  # Simplified attack labeling
    final_dfs.append(augmented_df)

# Keep medium-sized classes as is
medium_mask = (category_counts >= min_samples) & (category_counts <= max_samples)
for category in category_counts[medium_mask].index:
    final_dfs.append(df[df['category'] == category])

# Combine all data
balanced_df = pd.concat(final_dfs, ignore_index=True)
balanced_df.to_csv('dataset/train_smote.csv', index=False)

# Get final counts and prepare comparison
final_counts = balanced_df['category'].value_counts()

# Create comparison table
comparison_data = []
for category in sorted(category_counts.index):
    comparison_data.append([
        category,
        category_counts[category],
        final_counts.get(category, 0)
    ])

# Print comparison table
print(tabulate(
    comparison_data,
    headers=['Category', 'Original', 'After Balance'],
    tablefmt='psql'
))
+------------+------------+-----------------+
| Category   |   Original |   After Balance |
|------------+------------+-----------------|
| BENIGN     |     192732 |          165186 |
| DDoS       |     581986 |          165186 |
| DoS        |     284552 |          165186 |
| MQTT       |     107607 |          107607 |
| RECON      |      52541 |           75084 |
| SPOOFING   |      16047 |           32094 |
+------------+------------+-----------------+

In this section we prepared and preprocessed the input data for the development of a neural network model whose purpose is to classify data distinguishing between benign traffic and each attack.

First of all, label encoding, converting categorical labels into integers, allows for the compatibility of data type with the classification task of NN. In order to stabilize the learning process of NN and facilitate the convergence, it was necessary to standardize the input features, making the range of values more restricted.

Then the use of PyTorch’s TensorDataset and DataLoader improves the performance by batching, shuffling and loading the data during the training.

In [ ]:
def prepare_data_loaders(df, batch_size=64, test_size=0.3, random_state=37):
    """
    Prepare data loaders for training and validation
    """
    # Initialize preprocessing objects
    label_encoder = LabelEncoder()
    scaler = StandardScaler()

    # Prepare features and labels
    x_train = df.drop(['is_benign', 'category', 'attack'], axis=1).values
    y_train = df['category']

    # Encode labels as integers
    y_train = label_encoder.fit_transform(y_train)

    # Split data
    x_train_data, x_val_data, y_train_data, y_val_data = train_test_split(
        x_train, y_train, test_size=test_size, random_state=random_state
    )

    # Scale features
    x_train_scaled = scaler.fit_transform(x_train_data)
    x_val_scaled = scaler.transform(x_val_data)

    # Convert to PyTorch tensors
    x_train_tensor = torch.tensor(x_train_scaled, dtype=torch.float32)
    x_val_tensor = torch.tensor(x_val_scaled, dtype=torch.float32)
    y_train_tensor = torch.tensor(y_train_data, dtype=torch.long)
    y_val_tensor = torch.tensor(y_val_data, dtype=torch.long)

    # Create datasets and loaders
    train_dataset = TensorDataset(x_train_tensor, y_train_tensor)
    val_dataset = TensorDataset(x_val_tensor, y_val_tensor)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

    return {
        'train_loader': train_loader,
        'val_loader': val_loader,
        'val_dataset': val_dataset,
        'input_size': x_train.shape[1],
        'num_classes': len(label_encoder.classes_),
        'class_counts': torch.bincount(y_train_tensor),
        'y_train_tensor': y_train_tensor,
        'classes': label_encoder.classes_
    }

This neural network is designed to handle imbalanced data effectively. The feature extractor captures essential patterns using normalization, dropout, and non-linear activation. Residual connections in the architecture address vanishing gradients, allowing deeper learning even with imbalanced datasets. Robust initialization with He initialization ensures stable training, making the model resilient and capable of generalizing better across all classes. This approach is essential for neural networks, which are particularly sensitive to class imbalances and require careful design to prevent bias.

In [ ]:
class FeatureExtractor(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_size, 512),
            nn.BatchNorm1d(512),
            nn.LeakyReLU(0.1),
            nn.Dropout(0.1)
        )

    def forward(self, x):
        return self.layers(x)

class ResidualBlock(nn.Module):
    def __init__(self, size):
        super().__init__()
        self.block = nn.Sequential(
            nn.BatchNorm1d(size),
            nn.Linear(size, 2*size),
            nn.LeakyReLU(0.1),
            nn.Linear(2*size, size),
            nn.Dropout(0.1),
        )
        self.activation = nn.LeakyReLU(0.1)

    def forward(self, x):
        identity = x
        out = self.block(x)
        out += identity
        return self.activation(out)

class NeuralNetwork(nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()

        # Feature extraction path
        self.feature_extractor = FeatureExtractor(input_size)

        # Main processing path with residual connections
        self.main_path = nn.Sequential(
          ResidualBlock(512),
          ResidualBlock(512),
          ResidualBlock(512),
          nn.Linear(512, 256),
          nn.BatchNorm1d(256),
          nn.LeakyReLU(0.2),
          nn.Dropout(0.1)
        )

        self.classifier = nn.Sequential(
            nn.Linear(256, num_classes)
        )

        self._initialize_weights()

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='leaky_relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm1d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

    def forward(self, x):
        # Extract features
        features = self.feature_extractor(x)

        # Process through main path
        main_features = self.main_path(features)

        # Classification
        output = self.classifier(main_features)

        return output

def get_optimizer(model, learning_rate=0.001, weight_decay=1e-5):
    """
    Create optimizer for the model
    """
    return torch.optim.AdamW(
        model.parameters(),
        lr=learning_rate,
        weight_decay=weight_decay,
        betas=(0.9, 0.999)
    )

def get_scheduler(optimizer):
    """
    Create learning rate scheduler
    """
    return torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer,
        mode='min',
        factor=0.1,
        patience=5
    )

# Example of model initialization (to be used in the training loop):
def initialize_model(input_size, num_classes, device):
    """
    Initialize the model, optimizer, and scheduler
    """
    model = NeuralNetwork(input_size, num_classes).to(device)
    optimizer = get_optimizer(model)
    scheduler = get_scheduler(optimizer)

    return model, optimizer, scheduler

A dictionary named “history” is created in order to record the performance of the main metrics in each epoch during the training. Every metric evolution is then plotted on a line chart with specifically defined colors (blue for Training Loss, orange for Validation Loss and green for Accuracy).

In [ ]:
class TrainingMetrics:
    def __init__(self):
        self.history = {
            'train_loss': [],
            'val_loss': [],
            'accuracy': [],
            'learning_rates': []
        }

    def update(self, train_loss, val_loss, accuracy, lr):
        self.history['train_loss'].append(train_loss)
        self.history['val_loss'].append(val_loss)
        self.history['accuracy'].append(accuracy)
        self.history['learning_rates'].append(lr)

    def plot(self, dataset_name):
        fig, ax1 = plt.subplots(figsize=(20, 10))
        
        # Plot metrics on primary y-axis
        metrics = ['train_loss', 'val_loss', 'accuracy']
        colors = ['blue', 'orange', 'green']
        for metric, color in zip(metrics, colors):
            ax1.plot(self.history[metric], 
                    label=metric.replace('_', ' ').title(),
                    color=color)
        
        # Create secondary y-axis for learning rate
        ax2 = ax1.twinx()
        ax2.set_yscale('log')  # Set logarithmic scale for learning rate
        ax2.plot(self.history['learning_rates'], 
                label='Learning Rate',
                color='red',
                linestyle='--')
        
        # Customize plot
        ax1.set_xlabel('Epoch')
        ax1.set_ylabel('Metrics Value')
        ax2.set_ylabel('Learning Rate (log scale)')
        
        # Combine legends from both axes
        lines1, labels1 = ax1.get_legend_handles_labels()
        lines2, labels2 = ax2.get_legend_handles_labels()
        ax1.legend(lines1 + lines2, labels1 + labels2, loc='center right')
        
        plt.title(f'Training Metrics - {dataset_name}')
        plt.grid(True)
        plt.tight_layout()
        plt.show()

The following code describes the implementation of the NN model, designed to classify malicious attack types and benign network flows. The strength point of this phase is the fact that the model is trained sequentially on three distinct datasets, whose performances are then compared, respectively:

  • train_labeled
  • train_smote, in which data have been rebalanced in order to oversample benign data.
  • train_sel_hclust

For each epoch, during the training phase the model processes batch of data to calculate the loss (cross-entropy loss). The optimizer updates the model’s weight using backpropagation and the scheduler dynamically fix the learning rate based on the validation loss. Then the model is evaluated on the validation set and the metrics are calculated. If the model accuracy is higher than the best accuracy up to that moment, the model’s state is saved as best model; otherwise if accuracy does not improve after a certain number of epochs, defined by patience arbitrarily set to 15, but reduced to 9 under certain conditions, in order to improve training efficiency and avoid overfitting the training stops (early stopping).

After training, the “evaluate_model” function calculates performance metrics (respectively, precision, recall and F1-Score) comparing the predicted labels with the real ones. Results are shown in a bar chart representing metrics for each individual class. Furthermore, a normalized confusion matrix is plotted in order to easily visualize the types of errors the model makes.

A summary table consolidates the results for easy comparison of best accuracy and validation loss across the three datasets.

In [ ]:
def evaluate_nn(model, val_loader, classes, device):
    """Enhanced model evaluation with custom visualizations"""
    y_pred = []
    y_true = []
    probs = []
    
    model.eval()
    with torch.no_grad():
        for inputs, labels in val_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            _, predicted = torch.max(outputs, 1)
            probs.append(torch.softmax(outputs, dim=1).cpu().numpy())
            y_pred.extend(predicted.cpu().numpy())
            y_true.extend(labels.cpu().numpy())
            
    y_pred = np.array(y_pred)
    y_true = np.array(y_true)
    probs = np.concatenate(probs)
    
    # Plot metrics
    _, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
    
    # Custom Classification Report Visualization
    metrics = {
        'Precision': precision_score(y_true, y_pred, average=None),
        'Recall': recall_score(y_true, y_pred, average=None),
        'F1-Score': f1_score(y_true, y_pred, average=None)
    }
    
    metrics_df = pd.DataFrame(metrics, index=classes)
    metrics_df.plot(kind='bar', ax=ax1, color=sns.color_palette("Blues", n_colors=len(metrics)))
    ax1.set_title('Classification Metrics by Class')
    ax1.set_xticklabels(classes, rotation=45)
    ax1.grid(True, alpha=0.3)
    
    # Enhanced Confusion Matrix
    sns.heatmap(confusion_matrix(y_true, y_pred, normalize='true'),
                annot=True, fmt='.2f', cmap='Blues',
                xticklabels=classes, yticklabels=classes, ax=ax2)
    ax2.set_title('Normalized Confusion Matrix')
    ax2.set_xlabel('Predicted')
    ax2.set_ylabel('True')
    
    plt.tight_layout()
    plt.show()
    
    # Return metrics for logging
    return {
        'accuracy': accuracy_score(y_true, y_pred),
        'macro_f1': f1_score(y_true, y_pred, average='macro'),
        'metrics_by_class': metrics_df
    }

def train_model(model, train_loader, val_loader, val_dataset, criterion, optimizer, 
                scheduler, epochs, device, dataset_name, save_dir, goat):
    
    print(f"Training on dataset: {dataset_name}")
    
    metrics = TrainingMetrics()
    best_val_loss = float('inf')
    best_accuracy = 0.0
    patience = 9 if goat != 0 else 15
    counter = 0

    for epoch in range(epochs):
        # Training phase
        model.train()
        epoch_loss = 0
        num_batches = len(train_loader)
        
        for batch_X, batch_y in tqdm(train_loader, desc=f"{dataset_name} - Epoch {epoch + 1}/{epochs}", bar_format='{desc}: {elapsed}'):
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

        # Validation phase
        model.eval()
        val_loss = 0
        correct = 0
        with torch.no_grad():
            for batch_X, batch_y in val_loader:
                batch_X, batch_y = batch_X.to(device), batch_y.to(device)
                outputs = model(batch_X)
                val_loss += criterion(outputs, batch_y).item()
                _, predicted = torch.max(outputs, 1)
                correct += (predicted == batch_y).sum().item()

        # Compute metrics
        avg_train_loss = epoch_loss / num_batches
        avg_val_loss = val_loss / len(val_loader)
        accuracy = correct / len(val_dataset)
        current_lr = optimizer.param_groups[0]['lr']
        
        metrics.update(avg_train_loss, avg_val_loss, accuracy, current_lr)
        
        # Update best model
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_val_loss = avg_val_loss
            torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'scheduler_state_dict': scheduler.state_dict(),
                'best_accuracy': best_accuracy,
                'best_val_loss': best_val_loss
            }, Path(save_dir) / f"best_model_{dataset_name}.pth")
            
        if best_accuracy > goat:
            counter = 0
            goat = best_accuracy
            patience = 15
        else:
            counter += 1
        
        scheduler.step(avg_val_loss)
        
        if counter >= patience:
            print(f"Early stopping after {patience} epochs without improvement")
            break
            
    metrics.plot(dataset_name)
    return best_accuracy, best_val_loss, goat

def train_on_multiple_datasets(dataset_paths, save_dir='models', merged=False):
    """
    Train the model sequentially on multiple datasets
    """
    # Setup device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

    # Create save directory
    save_dir = Path(save_dir)

    # Training configuration
    config = {
        'batch_size': 64,
        'epochs': 30,
        'learning_rate': 0.001,
        'weight_decay': 1e-5,
    }

    results = {}
    goat = 0.0

    # Train on each dataset sequentially
    for dataset_path in dataset_paths:
        dataset_name = Path(dataset_path).stem

        # Load and prepare data
        df = pd.read_csv(dataset_path, low_memory=False)
        
        if merged:
            df['category'] = df['category'].replace({
                                        'DDoS': 'DOS_DDOS',
                                        'DoS': 'DOS_DDOS'
                                    })
            
            dataset_name += '_merged'
        
        data = prepare_data_loaders(df, batch_size=config['batch_size'])
        class_counts = data['class_counts'] 

        # Initialize model, optimizer, and criterion
        model = NeuralNetwork(data['input_size'], data['num_classes']).to(device)
        optimizer = get_optimizer(model, config['learning_rate'], config['weight_decay'])
        scheduler = get_scheduler(optimizer)

        # Calculate class weights for balanced training
        total_samples = len(data['y_train_tensor'])
        class_weights = total_samples / (len(data['class_counts']) * data['class_counts'])
        criterion = nn.CrossEntropyLoss(weight=class_weights.to(device))

        best_accuracy, best_val_loss, goat = train_model(
            model=model,
            train_loader=data['train_loader'],
            val_loader=data['val_loader'],
            val_dataset=data['val_dataset'],
            criterion=criterion,
            optimizer=optimizer,
            scheduler=scheduler,
            epochs=config['epochs'],
            device=device,
            dataset_name=dataset_name,
            save_dir=save_dir,
            goat=goat
        )

        # Check if the best model file exists
        best_model_path = Path(save_dir) / f"best_model_{dataset_name}.pth"
        if not best_model_path.exists():
            print(f"Best model file for {dataset_name} not found. Skipping evaluation.")
            continue

        # Load best model for evaluation
        checkpoint = torch.load(best_model_path, weights_only=True);
        model.load_state_dict(checkpoint['model_state_dict'])

        # Evaluate model
        # print(f"\nEvaluating model for dataset: {dataset_name}")
        evaluate_nn(model, data['val_loader'], data['classes'], device)

        # Store results
        results[dataset_name] = {
            'best_accuracy': best_accuracy,
            'best_val_loss': best_val_loss
        }

    # Print final results summary in a table using tabulate
    results_df = pd.DataFrame(results).T
    results_df.columns = ['Best Accuracy', 'Best Validation Loss']
    print("\nTraining Results Summary:")
    print(tabulate(results_df, headers='keys', tablefmt='psql'))
In [ ]:
dataset_paths = [
    "dataset/train_labeled.csv",
    "dataset/train_smote.csv",
    "dataset/train_sel_hclust.csv",
]

train_on_multiple_datasets(dataset_paths)
Using device: cuda
Training on dataset: train_labeled
train_labeled - Epoch 1/30: 00:57
train_labeled - Epoch 2/30: 00:57
train_labeled - Epoch 3/30: 00:57
train_labeled - Epoch 4/30: 00:56
train_labeled - Epoch 5/30: 00:58
train_labeled - Epoch 6/30: 00:58
train_labeled - Epoch 7/30: 00:57
train_labeled - Epoch 8/30: 00:57
train_labeled - Epoch 9/30: 00:58
train_labeled - Epoch 10/30: 00:57
train_labeled - Epoch 11/30: 00:57
train_labeled - Epoch 12/30: 00:57
train_labeled - Epoch 13/30: 00:57
train_labeled - Epoch 14/30: 00:58
train_labeled - Epoch 15/30: 00:57
train_labeled - Epoch 16/30: 00:57
train_labeled - Epoch 17/30: 00:57
train_labeled - Epoch 18/30: 00:57
train_labeled - Epoch 19/30: 00:57
train_labeled - Epoch 20/30: 00:57
train_labeled - Epoch 21/30: 00:57
train_labeled - Epoch 22/30: 00:57
train_labeled - Epoch 23/30: 00:57
train_labeled - Epoch 24/30: 00:57
train_labeled - Epoch 25/30: 00:57
train_labeled - Epoch 26/30: 00:57
train_labeled - Epoch 27/30: 00:57
train_labeled - Epoch 28/30: 00:57
Early stopping after 15 epochs without improvement
No description has been provided for this image
No description has been provided for this image
Training on dataset: train_smote
train_smote - Epoch 1/30: 00:33
train_smote - Epoch 2/30: 00:33
train_smote - Epoch 3/30: 00:33
train_smote - Epoch 4/30: 00:33
train_smote - Epoch 5/30: 00:33
train_smote - Epoch 6/30: 00:33
train_smote - Epoch 7/30: 00:33
train_smote - Epoch 8/30: 00:33
train_smote - Epoch 9/30: 00:33
train_smote - Epoch 10/30: 00:33
train_smote - Epoch 11/30: 00:33
train_smote - Epoch 12/30: 00:33
train_smote - Epoch 13/30: 00:33
train_smote - Epoch 14/30: 00:33
train_smote - Epoch 15/30: 00:33
train_smote - Epoch 16/30: 00:33
train_smote - Epoch 17/30: 00:33
train_smote - Epoch 18/30: 00:33
train_smote - Epoch 19/30: 00:33
train_smote - Epoch 20/30: 00:33
train_smote - Epoch 21/30: 00:33
train_smote - Epoch 22/30: 00:33
train_smote - Epoch 23/30: 00:33
train_smote - Epoch 24/30: 00:33
train_smote - Epoch 25/30: 00:33
train_smote - Epoch 26/30: 00:33
train_smote - Epoch 27/30: 00:33
train_smote - Epoch 28/30: 00:33
Early stopping after 15 epochs without improvement
No description has been provided for this image
No description has been provided for this image
Training on dataset: train_sel_hclust
train_sel_hclust - Epoch 1/30: 00:58
train_sel_hclust - Epoch 2/30: 00:57
train_sel_hclust - Epoch 3/30: 00:57
train_sel_hclust - Epoch 4/30: 00:58
train_sel_hclust - Epoch 5/30: 00:57
train_sel_hclust - Epoch 6/30: 00:58
train_sel_hclust - Epoch 7/30: 00:57
train_sel_hclust - Epoch 8/30: 00:57
train_sel_hclust - Epoch 9/30: 00:57
Early stopping after 9 epochs without improvement
No description has been provided for this image
No description has been provided for this image
Training Results Summary:
+------------------+-----------------+------------------------+
|                  |   Best Accuracy |   Best Validation Loss |
|------------------+-----------------+------------------------|
| train_labeled    |        0.772842 |               0.35129  |
| train_smote      |        0.803503 |               0.324712 |
| train_sel_hclust |        0.70862  |               0.382404 |
+------------------+-----------------+------------------------+

Given these result, it is clear how difficult for the model it is to distinguish between DoS and DDoS attacks. In order to address this issue, we tried to merge the two categories (also due to the fact that at their root these attacks are similar).

In [ ]:
train_on_multiple_datasets(dataset_paths, merged = True)
Using device: cuda
Training on dataset: train_labeled
train_labeled - Epoch 1/30: 00:58
train_labeled - Epoch 2/30: 00:57
train_labeled - Epoch 3/30: 00:57
train_labeled - Epoch 4/30: 00:57
train_labeled - Epoch 5/30: 00:57
train_labeled - Epoch 6/30: 00:57
train_labeled - Epoch 7/30: 00:58
train_labeled - Epoch 8/30: 00:58
train_labeled - Epoch 9/30: 00:57
train_labeled - Epoch 10/30: 00:57
train_labeled - Epoch 11/30: 00:59
train_labeled - Epoch 12/30: 00:58
train_labeled - Epoch 13/30: 00:57
train_labeled - Epoch 14/30: 00:57
train_labeled - Epoch 15/30: 00:57
train_labeled - Epoch 16/30: 00:57
train_labeled - Epoch 17/30: 00:57
train_labeled - Epoch 18/30: 00:57
train_labeled - Epoch 19/30: 00:57
train_labeled - Epoch 20/30: 00:57
train_labeled - Epoch 21/30: 00:57
train_labeled - Epoch 22/30: 00:57
train_labeled - Epoch 23/30: 00:57
train_labeled - Epoch 24/30: 00:57
train_labeled - Epoch 25/30: 00:57
train_labeled - Epoch 26/30: 00:57
train_labeled - Epoch 27/30: 00:57
train_labeled - Epoch 28/30: 00:57
train_labeled - Epoch 29/30: 00:57
train_labeled - Epoch 30/30: 00:57
No description has been provided for this image
No description has been provided for this image
Training on dataset: train_smote
train_smote - Epoch 1/30: 00:33
train_smote - Epoch 2/30: 00:33
train_smote - Epoch 3/30: 00:33
train_smote - Epoch 4/30: 00:33
train_smote - Epoch 5/30: 00:33
train_smote - Epoch 6/30: 00:33
train_smote - Epoch 7/30: 00:33
train_smote - Epoch 8/30: 00:33
train_smote - Epoch 9/30: 00:33
Early stopping after 9 epochs without improvement
No description has been provided for this image
No description has been provided for this image
Training on dataset: train_sel_hclust
train_sel_hclust - Epoch 1/30: 00:58
train_sel_hclust - Epoch 2/30: 00:58
train_sel_hclust - Epoch 3/30: 00:57
train_sel_hclust - Epoch 4/30: 00:57
train_sel_hclust - Epoch 5/30: 00:58
train_sel_hclust - Epoch 6/30: 00:57
train_sel_hclust - Epoch 7/30: 00:57
train_sel_hclust - Epoch 8/30: 00:57
train_sel_hclust - Epoch 9/30: 00:57
Early stopping after 9 epochs without improvement
No description has been provided for this image
No description has been provided for this image
Training Results Summary:
+------------------+-----------------+------------------------+
|                  |   Best Accuracy |   Best Validation Loss |
|------------------+-----------------+------------------------|
| train_labeled    |        0.97643  |               0.150561 |
| train_smote      |        0.957335 |               0.158727 |
| train_sel_hclust |        0.972842 |               0.217276 |
+------------------+-----------------+------------------------+

TESTING¶

In [ ]:
def plot_model_metrics(results, name="VALIDATION"):
    metrics = ['accuracy', 'precision', 'recall', 'f1']
    model_names = ['Logistic Regression', 'Random Forest', 'XGBoost', 
                    'Neural Network (6 Classes)', 'Neural Network (5 Classes)']

    fig, ax = plt.subplots(figsize=(15, 6))
    width = 0.15
    x = np.arange(len(metrics))

    # Create a palette with 7 colors and use only last 5
    colors = sns.color_palette("Blues", n_colors=len(models)+2)[2:]

    for i, (model, color) in enumerate(zip(model_names, colors)):
        values = [results[f"{model}_{metric}"][0] for metric in metrics]
        rects = ax.bar(x + i * width, values, width, label=model, color=color)
        
        for rect in rects:
            height = rect.get_height()
            ax.text(rect.get_x() + rect.get_width()/2., height/2.,
                    f'{height:.2f}',
                    ha='center', va='center', 
                    color='white' if height > 0.3 else 'black',
                    fontsize=10, 
                    weight='bold')

    ax.set_xlabel('Metrics', weight='bold')
    ax.set_title(f'Comparison of Classification Metrics ({name})', weight='bold')
    ax.set_xticks(x + width * 2)
    ax.set_xticklabels(metrics)
    ax.set_ylim([0, 1])
    ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

    plt.tight_layout()
    plt.show()

LOADING MODELS

The NNWrapper class is used to wrap the neural networks, providing a unified predict method for compatibility with the training and evaluation pipeline. This is necessary because neural networks in PyTorch do not inherently have a predict function, unlike scikit-learn models. The wrapper converts input data into tensors and ensures predictions are made efficiently in evaluation mode.

In [ ]:
class NNWrapper:
   def __init__(self, model):
       self.model = model
       
   def predict(self, X):
       self.model.eval()
       with torch.no_grad():
           X_tensor = torch.tensor(X.values, dtype=torch.float32)
           logits = self.model(X_tensor)
           return torch.argmax(logits, dim=1).numpy()
In [ ]:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

log_reg = load('models/log_reg_multiclass.joblib')
rnd_frs = load('models/rnd_forest_multiclass.joblib')
xgboost = load('models/xgboost_multiclass.joblib')

# Define the model architecture
loaded = torch.load('models/best_model_train_smote.pth', map_location=torch.device('cpu'))
input_size = loaded['model_state_dict']['feature_extractor.layers.0.weight'].shape[1]
num_classes = loaded['model_state_dict']['classifier.0.weight'].shape[0]
nn_6cla = NeuralNetwork(input_size, num_classes)
nn_6cla.load_state_dict(torch.load('models/best_model_train_smote.pth', map_location=torch.device('cpu'))['model_state_dict'])
nn_6cla.eval()

loaded = torch.load('models/best_model_train_labeled_merged.pth', map_location=torch.device('cpu'))
input_size = loaded['model_state_dict']['feature_extractor.layers.0.weight'].shape[1]
num_classes = loaded['model_state_dict']['classifier.0.weight'].shape[0]
nn_5cla = NeuralNetwork(input_size, num_classes)
nn_5cla.load_state_dict(torch.load('models/best_model_train_labeled_merged.pth', map_location=torch.device('cpu'))['model_state_dict'])
nn_5cla.eval()

models = {
    'Logistic Regression': log_reg,
    'Random Forest': rnd_frs,
    'XGBoost': xgboost,
    'Neural Network (6 Classes)': NNWrapper(nn_6cla),
    'Neural Network (5 Classes)': NNWrapper(nn_5cla)
}

VALIDATION SET

Each model is linked with specific dataset requirements, such as scaling or encoding, to align with its training conditions. The script evaluates all models on validation datasets, calculating metrics like accuracy, precision, recall, and F1 score for consistent performance comparison.

In [ ]:
def process_train_dataset(df_path, needs_scaling=False, needs_encoding=False):
    df = pd.read_csv(df_path)
    if 'label' in df_path:
        df['category'] = df['category'].replace({
            'DDoS': 'DOS_DDOS',
            'DoS': 'DOS_DDOS'
        })
        
    X = df.drop(['is_benign', 'category', 'attack'], axis=1)
    y = df['category']
    
    X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.7, random_state=37)
    
    scaler = None
    if needs_scaling:
        scaler = StandardScaler()
        scaler.fit(X_train)
        X_val = pd.DataFrame(scaler.transform(X_val), columns=X_val.columns)
    
    if needs_encoding:
        le = LabelEncoder()
        y_val = le.fit_transform(y_val)
    
    del df, X, y, X_train, y_train

    return X_val, y_val
In [ ]:
import warnings
warnings.filterwarnings('ignore', category=UserWarning)

train_requirements = {
    'Logistic Regression': ('dataset/train_sel_hclust.csv', True, False),
    'Random Forest': ('dataset/train_sel_hclust.csv', False, False),
    'XGBoost': ('dataset/train_sel_hclust.csv', False, True),
    'Neural Network (6 Classes)': ('dataset/train_smote.csv', True, True),
    'Neural Network (5 Classes)': ('dataset/train_labeled.csv', True, True)
}

train_results = defaultdict(list)

for model_name, model in models.items():
    path, needs_scaling, needs_encoding = train_requirements[model_name]
    result = process_train_dataset(path, needs_scaling, needs_encoding)
    
    X_val, y_val = result

    y_pred = model.predict(X_val)
    metrics = {
        'accuracy': accuracy_score(y_val, y_pred),
        'precision': precision_score(y_val, y_pred, average='macro'),
        'recall': recall_score(y_val, y_pred, average='macro'),
        'f1': f1_score(y_val, y_pred, average='macro')
    }
    
    for metric, value in metrics.items():
        train_results[f"{model_name}_{metric}"].append(value)
    
    del X_val, y_val
In [ ]:
plot_model_metrics(train_results)
No description has been provided for this image

TEST SET

The function processes test data by aligning with each model's training conditions: fitting StandardScaler on training data before transforming test features, handling label transformations (DDoS/DoS -> DOS_DDOS), and encoding if needed, as before

In [ ]:
def process_test_data(test_path, train_path, needs_scaling=False, needs_encoding=False):
    df_test = pd.read_csv(test_path)
    if 'hclust' in train_path  or needs_encoding:
        df_train = pd.read_csv(train_path)
    
    if 'hclust' in train_path:
        df_test = df_test[df_train.columns.intersection(df_test.columns)]
    
    X_test = df_test.drop(['is_benign', 'category', 'attack'], axis=1)
    y_test = df_test['category']
    
    # Handle label transformations for labeled datasets
    if 'label' in train_path:
        y_test = y_test.replace({
            'DDoS': 'DOS_DDOS',
            'DoS': 'DOS_DDOS'
        })
        
    if needs_encoding:
        le = LabelEncoder()
        y_test = le.fit_transform(y_test)
    
    if needs_scaling:
        X_train, _, _, _ = train_test_split(df_train.drop(['is_benign', 'category', 'attack'], axis=1), df_train['category'], train_size=0.7, random_state=37)

        scaler = StandardScaler()
        scaler.fit(X_train)        
        X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
    
    return X_test, y_test
In [ ]:
import warnings
warnings.filterwarnings('ignore', category=UserWarning)

test_requirements = {
    'Logistic Regression': ('dataset/test_labeled.csv', 'dataset/train_sel_hclust.csv', True, False),
    'Random Forest': ('dataset/test_labeled.csv', 'dataset/train_sel_hclust.csv', False, False),
    'XGBoost': ('dataset/test_labeled.csv', 'dataset/train_sel_hclust.csv', False, True),
    'Neural Network (6 Classes)': ('dataset/test_labeled.csv', 'dataset/train_smote.csv', True, True),
    'Neural Network (5 Classes)': ('dataset/test_labeled.csv', 'dataset/train_labeled.csv', True, True)
}

test_results = defaultdict(list)
for model_name, model in models.items():
    test_path, train_path, needs_scaling, needs_encoding = test_requirements[model_name]
    X_test, y_test = process_test_data(test_path, train_path, needs_scaling, needs_encoding)
    
    y_pred = model.predict(X_test)
    metrics = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred, average='macro', zero_division=0),
        'recall': recall_score(y_test, y_pred, average='macro', zero_division=0),
        'f1': f1_score(y_test, y_pred, average='macro', zero_division=0)
    }
    
    for metric, value in metrics.items():
        test_results[f"{model_name}_{metric}"].append(value)
In [ ]:
plot_model_metrics(test_results, "TEST")
No description has been provided for this image

Comparing with the results obtained in the binary classification, we see that the logistic regression worsened in the multiclass, while our decision tree based ensembles maintained a very good performance with just a slight decrese, confirming themselves as the strongest models.

The 6 classes NN performed badly if we consider that it did worse than the logistic regression, a model that is way simpler; on the other hand, the 5 classes NN scores where rather good. It's to notice that that of this project isn't the kind of data optimal for Neural Network usage.

5. UNSUPERVISED LEARNING¶

5.1 CLUSTERING¶

In [ ]:
df = pd.read_csv('dataset/train_sel_hclust.csv')

Data Preprocessing Functions¶

This code implements preprocessing function in order to rebalance and rescale data, which are beneficial before executing clustering algorithms.

In [ ]:
def balance_dataset(df, method='median_multiplier', multiplier=2, column='category'):
    """
    Balance the dataset by downsampling majority classes.
    
    Parameters:
    - df: pandas DataFrame with 'category' or 'attack' column
    - method: str, approach for calculating threshold
    - multiplier: float, multiplier for threshold calculation
    
    Returns:
    - balanced DataFrame
    """
    counts = df[column].value_counts()

    # Calculate threshold
    if method == 'median_multiplier':
        threshold = np.median(counts) * multiplier
    elif method == 'mean_multiplier':
        threshold = np.mean(counts) * multiplier
    elif method == 'quantile':
        threshold = counts.quantile(0.75)
    else:
        raise ValueError("Method must be 'median_multiplier', 'mean_multiplier', or 'quantile'")
    
    # Balance categories
    balanced_dfs = []
    for value in counts.index:
        value_df = df[df[column] == value]
        if len(value_df) > threshold:
            value_df = resample(value_df, replace=False, n_samples=int(threshold), random_state=42)
        balanced_dfs.append(value_df)
    
    balanced_df = pd.concat(balanced_dfs)
    comparison_data = [
        [value, counts[value], balanced_df[column].value_counts().get(value, 0)]
        for value in sorted(counts.index)
    ]

    comparison_data.sort(key=lambda x: x[1], reverse=True)

    print(tabulate(
        comparison_data,
        headers=[column, 'Original', 'After Balance'],
        tablefmt='psql'
    ))
    
    return balanced_df.sample(frac=0.1, random_state=37)

def preprocess_data_cluster(df, **kwargs):
    """Scale and balance data."""

    method = kwargs.get('method', 'median_multiplier')
    multiplier = kwargs.get('multiplier', 2)
    column = kwargs.get('column', 'category')

    df = balance_dataset(df, method=method, multiplier=multiplier, column=column)

    scaler = StandardScaler()
    scaled_df = scaler.fit_transform(df.drop(['is_benign', 'category', 'attack'], axis=1))


    return scaled_df, df[column]

Clustering Functions¶

In this section we decided to run K-Means and DBSCAN clustering, whose results are evaluated on the basis of specified metrics, respectively the Silhouette, the Calinski-Harabasz and Davies-Bouldin scores.

As regards K-Means, in line with what was explained previously for NN, we decided to implement the model on both 5 and 6 classes, due to the similarities between DoS and DDos attacks.

While for DBSCAN the clustering task is performed adopting various eps (representing the maximum distance between two samples to be considered in the same neighborhood) and min_samples (the minimum number of samples required to form a cluster) parameters. Once completed, the best result is saved.

In [ ]:
def kmeans_clustering(data, n_clusters):
    """Perform K-Means clustering and evaluate results."""
    kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=37) 
    labels = kmeans.fit_predict(data)
    print(f"K-Means Clustering with {n_clusters} clusters completed.")
    return labels, kmeans

# DBSCAN Clustering
def dbscan_clustering(data, labels, eps_values = [0.1, 0.3, 0.5, 0.7, 0.9]):
    """Perform DBSCAN clustering with different parameters and save the best result."""
    best_score = -1
    best_labels = None
    best_eps = None
    best_min_samples = None
    
    # Determine the least frequent category
    label_counts = pd.Series(labels).value_counts()
    least_frequent_count = label_counts.min()

    # Use some divisor of the least frequent count as min_samples_values
    min_samples_values = [max(1, least_frequent_count // i) for i in range(1, 6)]

    for eps in eps_values:
        for min_samples in min_samples_values:
            dbscan = DBSCAN(eps=eps, min_samples=min_samples)
            labels = dbscan.fit_predict(data)
            score = silhouette_score(data, labels)
            
            if score > best_score:
                best_score = score
                best_labels = labels
                best_eps = eps
                best_min_samples = min_samples
    
    print(f"Best DBSCAN Clustering with eps={best_eps}, min_samples={best_min_samples} completed.")
    return best_labels

# Evaluate Clustering
def evaluate_clustering(data, labels):
    """Evaluate clustering using different metrics."""
    silhouette = silhouette_score(data, labels)
    calinski = calinski_harabasz_score(data, labels)
    davies = davies_bouldin_score(data, labels)
    
    return silhouette, calinski, davies

In the visualization phase, we have choosen t-SNE for dimensionality reduction because the resulting plot is clearer and the readability enhanced with respect to other techniques, such as PCA.

In [ ]:
def visualize_clusters(data, cluster_labels_list, true_labels, method_names):
    """
    Visualize clustering results alongside true labels using t-SNE.
    Enhanced for better readability and comparison.
    """
    tsne = TSNE(n_components=2, random_state=37)
    reduced_data = tsne.fit_transform(data)
    
    n_methods = len(cluster_labels_list)
    _, axes = plt.subplots(n_methods, 2, figsize=(20, 8*n_methods))
    
    # If only one method, wrap axes in 2D array for consistent indexing
    if n_methods == 1:
        axes = axes.reshape(1, 2)
    
    
    for idx, (method_labels, method_name) in enumerate(zip(cluster_labels_list, method_names)):
        # Plot clustering results
        cluster_scatter = sns.scatterplot(
            x=reduced_data[:, 0],
            y=reduced_data[:, 1],
            hue=method_labels,
            palette='Set3',
            alpha=0.7,
            s=100,  # Larger point size
            ax=axes[idx, 0]
        )
        axes[idx, 0].set_title(f'{method_name}\nClustering Results', fontsize=12, pad=15)
        cluster_scatter.legend(title="Clusters", bbox_to_anchor=(1.02, 1), loc='upper left', title_fontsize=10)
        
        # Plot true labels
        true_scatter = sns.scatterplot(
            x=reduced_data[:, 0],
            y=reduced_data[:, 1],
            hue=true_labels,
            palette='Spectral',
            alpha=0.7,
            s=100,  # Larger point size
            ax=axes[idx, 1]
        )
        axes[idx, 1].set_title('True Labels\nGround Truth', fontsize=12, pad=15)
        true_scatter.legend(title="Categories", bbox_to_anchor=(1.02, 1), loc='upper left', title_fontsize=10)
        
        # Remove axis labels as they're not meaningful in t-SNE space
        axes[idx, 0].set_xlabel('')
        axes[idx, 0].set_ylabel('')
        axes[idx, 1].set_xlabel('')
        axes[idx, 1].set_ylabel('')
        
        # Add grid for better readability
        axes[idx, 0].grid(True, alpha=0.3)
        axes[idx, 1].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()
In [ ]:
scaled_data, labels = preprocess_data_cluster(df)
+------------+------------+-----------------+
| category   |   Original |   After Balance |
|------------+------------+-----------------|
| DDoS       |     581986 |          300339 |
| DoS        |     284552 |          284552 |
| BENIGN     |     192732 |          192732 |
| MQTT       |     107607 |          107607 |
| RECON      |      52541 |           52541 |
| SPOOFING   |      16047 |           16047 |
+------------+------------+-----------------+
In [ ]:
# Perform and evaluate K-Means with 6 clusters
kmeans_6_labels, kmeans_6 = kmeans_clustering(scaled_data, n_clusters=6)
kmeans_6_scores = evaluate_clustering(scaled_data, kmeans_6_labels)
K-Means Clustering with 6 clusters completed.
In [ ]:
# Perform and evaluate K-Means with 5 clusters
kmeans_5_labels, kmeans_5 = kmeans_clustering(scaled_data, n_clusters=5)
kmeans_5_scores = evaluate_clustering(scaled_data, kmeans_5_labels)
K-Means Clustering with 5 clusters completed.
In [ ]:
# Perform and evaluate DBSCAN
dbscan_labels = dbscan_clustering(scaled_data, labels)
dbscan_scores = evaluate_clustering(scaled_data, dbscan_labels)
Best DBSCAN Clustering with eps=0.9, min_samples=404 completed.

Results for each method are shown in a summary table reporting the selected metrics for comparison purposes. Likewise, the plots below compare side to side the clustering results obtain by the three methods. As observable, despite the fact that the clustering methods succeed in performing the task of separating clusters, they are not effective in identifying the classes needed for detection of malicious events.

In [ ]:
# Display Evaluation Results in a Table
evaluation_table = [
    ["K-Means-6", *kmeans_6_scores],
    ["K-Means-5", *kmeans_5_scores],
    ["DBSCAN", *dbscan_scores]
]

headers = ["Method", "Silhouette Score", "Calinski-Harabasz Score", "Davies-Bouldin Score"]
print("\nClustering Evaluation Summary:")
print(tabulate(evaluation_table, headers=headers, tablefmt="grid"))

# Visualize Clustering Results Side by Side
visualize_clusters(
    scaled_data,
    [kmeans_6_labels, kmeans_5_labels, dbscan_labels],
    labels,
    ["K-Means/6","K-Means/5", f"DBSCAN/{len(dbscan_labels)}"]
)
Clustering Evaluation Summary:
+-----------+--------------------+---------------------------+------------------------+
| Method    |   Silhouette Score |   Calinski-Harabasz Score |   Davies-Bouldin Score |
+===========+====================+===========================+========================+
| K-Means-6 |           0.402097 |                   9673.79 |                1.33808 |
+-----------+--------------------+---------------------------+------------------------+
| K-Means-5 |           0.38851  |                   9900.97 |                1.50973 |
+-----------+--------------------+---------------------------+------------------------+
| DBSCAN    |           0.479441 |                   3776.1  |                1.78247 |
+-----------+--------------------+---------------------------+------------------------+
No description has been provided for this image

5.2 Anomaly Detection¶

We created a new dataset for anomaly detection. We wanted to recreate a realistic scenario in which most of the traffic is good traffic, and a small portion is malicious. For the malicious traffic we selected an even fraction from each of the attack categories.

In [146]:
# Load the dataset
df = pd.read_csv('dataset/train_sel_hclust.csv')

# Select benign data
benign_data = df[df['category'] == 'BENIGN'].sample(frac=0.1, random_state=37)

# Define the sampling fraction per attack type
sample_size = 100

# Loop through each attack type and sample data
attack_data = pd.DataFrame()
attack_types = [col for col in df['category'].unique() if col != 'BENIGN']
for attack_type in attack_types:
    attack_subset = df[df['category'] == attack_type]
    sampled_data = attack_subset.sample(sample_size, random_state=37)
    attack_data = pd.concat([attack_data, sampled_data])

# Combine benign and attack samples
final_data = pd.concat([benign_data, attack_data])

# Save or use the new dataset
final_data.to_csv('dataset/train_anom_detect.csv', index=False)
In [147]:
df = pd.read_csv('dataset/train_anom_detect.csv')
In [148]:
sns.countplot(data=df, x='category');
No description has been provided for this image

Univariate Anomaly Detection¶

As a first method we looked at the outliers for each feature separately, using boxplots. In particular, the whiskers of the boxplots extend 1.5 times the interquartile above/below the top/bottom of the box. Specifically, the whisker is drawn up/down to the largest/lower observed point from the dataset that falls within this distance. Points outside that range are considered outliers and are shown on the plot as circles.

In [149]:
cols = df.drop(columns=['category','attack','is_benign']).columns
In [150]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()

df_minmax = df
df_minmax[cols] = min_max_scaler.fit_transform(df[cols])
In [151]:
ax = df_minmax[cols].boxplot(figsize=(12, 8))
plt.setp(ax.get_xticklabels(), rotation=90)
ax.grid(False)
plt.show()
No description has been provided for this image

We now color each outlier by the 'is_benign' boolean feature (green = benign traffic; red = malicious):

In [152]:
# Create the boxplot without showing the fliers (outliers)
ax = df_minmax[cols].boxplot(figsize=(12, 8), showfliers=False)
ax.grid(False)
plt.setp(ax.get_xticklabels(), rotation=90)

# Overlay the outliers with colors based on 'is_benign'
for i, col in enumerate(cols, start=1):  # start=1 to match boxplot position indexing
    data = df_minmax[col]
    q1 = data.quantile(0.25)
    q3 = data.quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr

    # Identify outliers
    outliers = data[(data < lower_bound) | (data > upper_bound)]
    benign_status = df_minmax.loc[outliers.index, 'is_benign']

    # Plot the outliers with color based on 'is_benign'
    plt.scatter([i] * len(outliers), outliers, 
                c=np.where(benign_status, 'green', 'red'),  # Green if benign, red otherwise
                label='Outliers', alpha=0.3, edgecolor='k')

plt.show()
No description has been provided for this image

What we can conclude:

  • The particular distribution of "IAT" variable makes box plot not suitable
  • In general the outliers do not match the division benign-malicious traffic, apart from a few features.

Let's explore those features:

In [153]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Bar plot for 'rst_flag_number'
sns.countplot(
    data=df,
    x='rst_flag_number',
    hue='is_benign',
    palette={1: 'green', 0: 'red'},
    ax=axes[0]
)
axes[0].set_xlabel('RST Flag Number')
axes[0].set_ylabel('Count')
axes[0].set_title('Count of RST Flag Number by Is Benign')
axes[0].set_yscale('log')  # Apply logarithmic scale to y-axis

# Show only a subset of x labels for 'rst_flag_number'
axes[0].set_xticks([0])
axes[0].set_xticklabels([0]) 

# Bar plot for 'fin_count'
sns.countplot(
    data=df,
    x='fin_count',
    hue='is_benign',
    palette={1: 'green', 0: 'red'},
    ax=axes[1]
)
axes[1].set_xlabel('FIN Count')
axes[1].set_ylabel('Count')
axes[1].set_title('Count of FIN Count by Is Benign')
axes[1].set_yscale('log')  # Apply logarithmic scale to y-axis

# Show only a subset of x labels for 'fin_count'
axes[1].set_xticks([0]) 
axes[1].set_xticklabels([0]) 

# Adjust layout for better appearance
plt.tight_layout()
plt.show()
No description has been provided for this image
In [154]:
plt.figure(figsize=(10, 6))

palette = {1: 'green', 0: 'red'}

sns.scatterplot(
    data=df,
    x='rst_flag_number',
    y='fin_count',
    hue='is_benign',
    palette=palette,
    alpha=0.5
)

# Add labels and a title
plt.xlabel('RST Flag Number')
plt.ylabel('FIN Count')
plt.title('Scatter Plot of RST Flag Number vs FIN Count')

# Show the legend and the plot
plt.legend(title='Is Benign')
plt.show()
No description has been provided for this image

Multivariate Anomaly Detection¶

Let's try now a ML based approach.

Let's define the main parameter for these anomaly detection models, 'outlier fraction':

In [155]:
outlier_fraction = len(df[df['is_benign']==0])/len(df[df['is_benign']==1])
print(outlier_fraction)
0.02594302910807866

Elliptic Envelope¶

In [158]:
import warnings
warnings.filterwarnings('ignore', category=UserWarning)

y_pred = EllipticEnvelope(contamination=outlier_fraction, random_state=37).fit_predict(df[cols])
In [159]:
sns.countplot(data=df[y_pred == -1], x='category');
No description has been provided for this image

Other Methods¶

In [160]:
# Define models to compare
models = {
    "IsolationForest": IsolationForest(contamination=outlier_fraction, random_state=37),
    "LocalOutlierFactor": LocalOutlierFactor(contamination=outlier_fraction, novelty=True),
    "OneClassSVM": OneClassSVM(nu=outlier_fraction),
}

# Prepare the dataset
results = []

# Apply each model
for model_name, model in models.items():
    # Fit and predict anomalies
    if hasattr(model, "fit_predict"):  # For IsolationForest
        y_pred = model.fit_predict(df[cols])
    else:  # For LocalOutlierFactor and OneClassSVM
        model.fit(df[cols].values)
        y_pred = model.predict(df[cols].values)

    # Convert predictions to DataFrame
    df_results = df.copy()
    df_results["anomaly"] = y_pred
    df_results["model"] = model_name
    
    # Append results
    results.append(df_results)

# Combine results into a single DataFrame
results_df = pd.concat(results)

# Filter anomalies
anomalies = results_df[results_df["anomaly"] == -1]

# Plot anomalies grouped by model
plt.figure(figsize=(12, 6))
sns.countplot(data=anomalies, x="category", hue="model")
plt.title("Anomalies Detected by Different Models")
plt.xlabel("Category")
plt.ylabel("Anomaly Count")
plt.legend(title="Model", loc="upper right")
plt.show()
No description has been provided for this image

After all, we can infer two main things:

  • the malicious traffic isn't really characterised by extreme values, thus overall is not identified by the outliers
  • connecting to what we pointed out in our visual analysis chapter, the distributions of our features (which are mostly numeric) follow atypical distributions that do not allow a meaningful description of outliers

CONCLUSION¶

Overall, the supervised models proved to be more effective for our analysis, whereas the unsupervised models were not beneficial for our topic. The supervised models achieved remarkable reliability in detecting attacks, excelling both in binary classification and in handling more complex multiclass scenarios.